From skip at pobox.com  Mon Jan 27 01:33:11 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun, 26 Jan 2003 18:33:11 -0600
Subject: DSVWizard.py
In-Reply-To: <1043622397.25146.2910.camel@software1.logiplex.internal>
References: <15921.59181.892148.382610@montanaro.dyndns.org>
        <KJEOLDOPMIDKCMJDCNDPKELECMAA.altis@semi-retired.com>
        <15922.5903.628119.997022@montanaro.dyndns.org>
        <1043622397.25146.2910.camel@software1.logiplex.internal>
Message-ID: <15924.32327.631412.57615@montanaro.dyndns.org>


I'm adding Dave Cole to the distribution list on this note.  Dave, Kevin
Altis, Cliff Wells (author of DSV) and I have exchanged a few messages about
trying to develop a CSV API for Python.

    >> I suspect most of the differences I see between the DSV and csv
    >> modules are due to interpretation differences between Cliff and Dave.

    Cliff> Or a bug in an older version of DSV.  If you have anything that
    Cliff> differs using 1.4, please pass it on so I can take a look at it.

I downloaded 1.4 just now.  The sfsample.csv file is now processed
identically by the two modules.  The nastiness.csv file generates three
differences though:

    % python shootout.py nastiness.csv 
    DSV: 0.01 seconds, 13 rows
    csv: 0.00 seconds, 13 rows
    2
    DSV: ['Test 1', 'Fred said "hey!", and left the room', '']
    csv: ['Test 1', ' "Fred said ""hey!""', ' and left the room"', ' ""']
    10
    DSV: ['Test 9', 'no spaces around this', ' but single spaces around this ']
    csv: ['Test 9', ' "no spaces around this" ', ' but single spaces around this ']
    12
    DSV: ['Test 11', 'has no spaces around anything', 'because the data is quoted']
    csv: ['   "Test 11"  ', '   "has no spaces around anything"   ', '   "because the data is quoted"    ']

All the three lines have white space immediately following separating
commas.  DSV appears to skip over this white space, while csv treats it as
part of the field contents.

Skip

PS, Just so Dave has the same "test harness", I've attached shootout.py and
nastiness.csv.  The shootout.py script now assumes DSV is installed with the
package structure of DSV 1.4.0.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: shootout.py
Type: application/octet-stream
Size: 730 bytes
Desc: not available
Url : http://mail.python.org/pipermail/csv/attachments/20030126/a4de7492/attachment.obj 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: nastiness.csv
Type: application/octet-stream
Size: 600 bytes
Desc: not available
Url : http://mail.python.org/pipermail/csv/attachments/20030126/a4de7492/attachment-0001.obj 

From skip at pobox.com  Mon Jan 27 01:37:24 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun, 26 Jan 2003 18:37:24 -0600
Subject: DSVWizard.py
In-Reply-To: <1043622397.25146.2910.camel@software1.logiplex.internal>
References: <15921.59181.892148.382610@montanaro.dyndns.org>
        <KJEOLDOPMIDKCMJDCNDPKELECMAA.altis@semi-retired.com>
        <15922.5903.628119.997022@montanaro.dyndns.org>
        <1043622397.25146.2910.camel@software1.logiplex.internal>
Message-ID: <15924.32580.130562.578623@montanaro.dyndns.org>


    Cliff> I think even Excel has the option to import files using "/'/none
    Cliff> for text qualifiers.  This was the only shortcoming I saw in csv
    Cliff> (only " is used for quoting).

Actually, csv's parser objects have a writable quote_char attribute:

    >>> import csv
    >>> p = csv.parser()
    >>> p.quote_char
    '"'
    >>> p.quote_char = "'"
    >>> p.quote_char
    "'"

Skip


From LogiplexSoftware at earthlink.net  Mon Jan 27 02:47:46 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 26 Jan 2003 17:47:46 -0800
Subject: DSVWizard.py
In-Reply-To: <15924.32327.631412.57615@montanaro.dyndns.org>
References: <15921.59181.892148.382610@montanaro.dyndns.org>
	 <KJEOLDOPMIDKCMJDCNDPKELECMAA.altis@semi-retired.com>
	 <15922.5903.628119.997022@montanaro.dyndns.org>
	 <1043622397.25146.2910.camel@software1.logiplex.internal>
	 <15924.32327.631412.57615@montanaro.dyndns.org>
Message-ID: <1043632066.25146.2950.camel@software1.logiplex.internal>

On Sun, 2003-01-26 at 16:33, Skip Montanaro wrote:
> I'm adding Dave Cole to the distribution list on this note.  Dave, Kevin
> Altis, Cliff Wells (author of DSV) and I have exchanged a few messages about
> trying to develop a CSV API for Python.
> 
>     >> I suspect most of the differences I see between the DSV and csv
>     >> modules are due to interpretation differences between Cliff and Dave.
> 
>     Cliff> Or a bug in an older version of DSV.  If you have anything that
>     Cliff> differs using 1.4, please pass it on so I can take a look at it.
> 
> I downloaded 1.4 just now.  The sfsample.csv file is now processed
> identically by the two modules.  The nastiness.csv file generates three
> differences though:
> 
>     % python shootout.py nastiness.csv 
>     DSV: 0.01 seconds, 13 rows
>     csv: 0.00 seconds, 13 rows
>     2
>     DSV: ['Test 1', 'Fred said "hey!", and left the room', '']
>     csv: ['Test 1', ' "Fred said ""hey!""', ' and left the room"', ' ""']

IMO, Dave's is incorrect in this one (unless he has specific reasons
otherwise).  The original line (from the csv file) is:

Test 1, "Fred said ""hey!"", and left the room", ""

The "" at the end is an empty, quoted field.  Maybe someone should run
this through Excel to see what it claims (I'd be willing to accept
Dave's interpretation if Excel does it this way, although I'd still feel
it was incorrect).  I handled this case specifically at a user's
request.

>     10
>     DSV: ['Test 9', 'no spaces around this', ' but single spaces around this ']
>     csv: ['Test 9', ' "no spaces around this" ', ' but single spaces around this ']
>     12
>     DSV: ['Test 11', 'has no spaces around anything', 'because the data is quoted']
>     csv: ['   "Test 11"  ', '   "has no spaces around anything"   ', '   "because the data is quoted"    ']
> 
> All the three lines have white space immediately following separating
> commas.  DSV appears to skip over this white space, while csv treats it as
> part of the field contents.

Again, this was at a user's request, and is special-case code in DSV
that can easily be removed.  The user noted, and I concurred, that given
a quoted field with whitespace around it, the whitespace should be
ignored.  However, once again I'd be willing to follow Excel's lead in
this because I'd also consider this to be malformed or at least
ambiguous data.

> 
> Skip
> 
> PS, Just so Dave has the same "test harness", I've attached shootout.py and
> nastiness.csv.  The shootout.py script now assumes DSV is installed with the
> package structure of DSV 1.4.0.
-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308


From djc at object-craft.com.au  Mon Jan 27 06:08:21 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 27 Jan 2003 16:08:21 +1100
Subject: DSVWizard.py
In-Reply-To: <1043632066.25146.2950.camel@software1.logiplex.internal>
References: <15921.59181.892148.382610@montanaro.dyndns.org>
	<KJEOLDOPMIDKCMJDCNDPKELECMAA.altis@semi-retired.com>
	<15922.5903.628119.997022@montanaro.dyndns.org>
	<1043622397.25146.2910.camel@software1.logiplex.internal>
	<15924.32327.631412.57615@montanaro.dyndns.org>
	<1043632066.25146.2950.camel@software1.logiplex.internal>
Message-ID: <m3vg0bgoxm.fsf@ferret.object-craft.com.au>


> On Sun, 2003-01-26 at 16:33, Skip Montanaro wrote:
> > I'm adding Dave Cole to the distribution list on this note.  Dave,
> > Kevin Altis, Cliff Wells (author of DSV) and I have exchanged a
> > few messages about trying to develop a CSV API for Python.
> > 
> >     >> I suspect most of the differences I see between the DSV and csv
> >     >> modules are due to interpretation differences between Cliff and Dave.
> > 
> >     Cliff> Or a bug in an older version of DSV.  If you have anything that
> >     Cliff> differs using 1.4, please pass it on so I can take a look at it.
> > 
> > I downloaded 1.4 just now.  The sfsample.csv file is now processed
> > identically by the two modules.  The nastiness.csv file generates
> > three differences though:
> > 
> >     % python shootout.py nastiness.csv 
> >     DSV: 0.01 seconds, 13 rows
> >     csv: 0.00 seconds, 13 rows
> >     2
> >     DSV: ['Test 1', 'Fred said "hey!", and left the room', '']
> >     csv: ['Test 1', ' "Fred said ""hey!""', ' and left the room"', ' ""']
> 
> IMO, Dave's is incorrect in this one (unless he has specific reasons
> otherwise).

Andrew (who has been included on th Cc) has tested the behaviour of
Excel (such as it is) and we do the same thing as Excel.  As to
whether Excel is doing the right thing, that is a different question
entirely.

One of the people we have done work for has some very nasty "CSV" data
to parse.  We have been trying to work out what to do to the CSV
module to handle some of the silly things he sees without breaking the
Excel compatibility.

> The original line (from the csv file) is:
> 
> Test 1, "Fred said ""hey!"", and left the room", ""
> 
> The "" at the end is an empty, quoted field.  Maybe someone should
> run this through Excel to see what it claims (I'd be willing to
> accept Dave's interpretation if Excel does it this way, although I'd
> still feel it was incorrect).  I handled this case specifically at a
> user's request.

Andrew, can you run that exact line through Excel?

> >     10
> >     DSV: ['Test 9', 'no spaces around this', ' but single spaces around this ']
> >     csv: ['Test 9', ' "no spaces around this" ', ' but single spaces around this ']
> >     12
> >     DSV: ['Test 11', 'has no spaces around anything', 'because the data is quoted']
> >     csv: ['   "Test 11"  ', '   "has no spaces around anything"   ', '   "because the data is quoted"    ']
> > 
> > All the three lines have white space immediately following
> > separating commas.  DSV appears to skip over this white space,
> > while csv treats it as part of the field contents.

I am fairly sure that is what Excel does.

> Again, this was at a user's request, and is special-case code in DSV
> that can easily be removed.  The user noted, and I concurred, that
> given a quoted field with whitespace around it, the whitespace
> should be ignored.  However, once again I'd be willing to follow
> Excel's lead in this because I'd also consider this to be malformed
> or at least ambiguous data.

Pity there is no real specification for CSV.

> > PS, Just so Dave has the same "test harness", I've attached
> > shootout.py and nastiness.csv.  The shootout.py script now assumes
> > DSV is installed with the package structure of DSV 1.4.0.


-- 
http://www.object-craft.com.au


From djc at object-craft.com.au  Mon Jan 27 06:13:34 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 27 Jan 2003 16:13:34 +1100
Subject: DSVWizard.py
In-Reply-To: <15924.32580.130562.578623@montanaro.dyndns.org>
References: <15921.59181.892148.382610@montanaro.dyndns.org>
	<KJEOLDOPMIDKCMJDCNDPKELECMAA.altis@semi-retired.com>
	<15922.5903.628119.997022@montanaro.dyndns.org>
	<1043622397.25146.2910.camel@software1.logiplex.internal>
	<15924.32580.130562.578623@montanaro.dyndns.org>
Message-ID: <m3r8azgoox.fsf@ferret.object-craft.com.au>


>     Cliff> I think even Excel has the option to import files using "/'/none
>     Cliff> for text qualifiers.  This was the only shortcoming I saw in csv
>     Cliff> (only " is used for quoting).
> 
> Actually, csv's parser objects have a writable quote_char attribute:
> 
>     >>> import csv
>     >>> p = csv.parser()
>     >>> p.quote_char
>     '"'
>     >>> p.quote_char = "'"
>     >>> p.quote_char
>     "'"

For all sorts of fun and games you can even turn off quoting.

>>> import csv
>>> p = csv.parser()
>>> p.join(['1','2,3','4'])
'1,"2,3",4'
>>> p.escape_char = '\\'
>>> p.join(['1','2,3','4'])
'1,"2,3",4'
>>> p.quote_char = None
>>> p.join(['1','2,3','4'])
'1,2\\,3,4'

- Dave

-- 
http://www.object-craft.com.au


From djc at object-craft.com.au  Mon Jan 27 06:18:53 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 27 Jan 2003 16:18:53 +1100
Subject: DSVWizard.py
In-Reply-To: <15924.32327.631412.57615@montanaro.dyndns.org>
References: <15921.59181.892148.382610@montanaro.dyndns.org>
	<KJEOLDOPMIDKCMJDCNDPKELECMAA.altis@semi-retired.com>
	<15922.5903.628119.997022@montanaro.dyndns.org>
	<1043622397.25146.2910.camel@software1.logiplex.internal>
	<15924.32327.631412.57615@montanaro.dyndns.org>
Message-ID: <m3iswbgog2.fsf@ferret.object-craft.com.au>


> I'm adding Dave Cole to the distribution list on this note.  Dave,
> Kevin Altis, Cliff Wells (author of DSV) and I have exchanged a few
> messages about trying to develop a CSV API for Python.

Python having a CSV API would be an excellent thing.  The most
difficult problem to solve is how to expose all of the CSV variations
so that users can work out how to drive the module.

I suppose the first step would be to catalogue all of common the CSV
variations and give them names.  Naming variations after the
applications which produce them could be the best way.

- Dave

-- 
http://www.object-craft.com.au


From LogiplexSoftware at earthlink.net  Mon Jan 27 18:02:04 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 27 Jan 2003 09:02:04 -0800
Subject: DSVWizard.py
In-Reply-To: <m3iswbgog2.fsf@ferret.object-craft.com.au>
References: <15921.59181.892148.382610@montanaro.dyndns.org>
	 <KJEOLDOPMIDKCMJDCNDPKELECMAA.altis@semi-retired.com>
	 <15922.5903.628119.997022@montanaro.dyndns.org>
	 <1043622397.25146.2910.camel@software1.logiplex.internal>
	 <15924.32327.631412.57615@montanaro.dyndns.org>
	 <m3iswbgog2.fsf@ferret.object-craft.com.au>
Message-ID: <1043686924.25146.2997.camel@software1.logiplex.internal>

On Sun, 2003-01-26 at 21:18, Dave Cole wrote:
> > I'm adding Dave Cole to the distribution list on this note.  Dave,
> > Kevin Altis, Cliff Wells (author of DSV) and I have exchanged a few
> > messages about trying to develop a CSV API for Python.
> 
> Python having a CSV API would be an excellent thing.  The most
> difficult problem to solve is how to expose all of the CSV variations
> so that users can work out how to drive the module.
> 
> I suppose the first step would be to catalogue all of common the CSV
> variations and give them names.  Naming variations after the
> applications which produce them could be the best way.

That doesn't sound like a bad idea, but the task of cataloging all those
applications seems a bit daunting, especially since I suspect between
all of us we can probably only account for a handful of them.  I suppose
we could have a place for users to submit csv samples from applications
they want supported.  The fact of the matter is, despite there being no
real standard, there seems to be only minor differences between each
format: delimiter, quote style, allowed spaces around quotes.  A
programmer who knows the specific style of the data he's importing could
specify via attributes or flags how to process the file.  For the
general case, DSV already has heuristics for determining the first two,
and adding code to test for the third case shouldn't be too difficult. 
Another problem with specifying styles by application name is that many
apps allow the user to specify portions of the style (usually the
delimiter), so that's not set in stone either.

I think what I'm leaning towards at this time, if everyone is in
agreement, is for Dave or myself to reimplement Dave's code (and API) in
Python so that there is a pure Python implementation, and then provide
Dave's C module as a faster alternative (much like Pickle and cPickle). 
The heuristics of DSV would be an optional feature, along with the GUI. 
Someone is already doing work on porting the wxPython GUI code to Qt,
but it would be useful for a Tk port to appear as well (I'm *not*
volunteering for that).  I also have serious doubts about the GUI
getting added to the core (even a Tk version), so that would have to be
spun off and maintained separately on SF.  I also expect that if a csv
module were added to the Python library, I could get Robin Dunn to add
the GUI for it to the wxPython libraries.

As far as DSV's current API, I'm not too attached to it, and I think
that it could be mimicked sufficiently by adding a parser.parseall()
method to Dave's API so the programmer would have the option of getting
the entire file as a list without having to write a loop.

Something I'd also like to see, and I think Kevin mentioned this, is a
generator interface for retrieving the data line by line.

I think that this would provide the most complete set of features and
best performance options.

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308


From skip at pobox.com  Mon Jan 27 18:36:26 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon, 27 Jan 2003 11:36:26 -0600
Subject: DSVWizard.py
In-Reply-To: <1043686924.25146.2997.camel@software1.logiplex.internal>
References: <15921.59181.892148.382610@montanaro.dyndns.org>
        <KJEOLDOPMIDKCMJDCNDPKELECMAA.altis@semi-retired.com>
        <15922.5903.628119.997022@montanaro.dyndns.org>
        <1043622397.25146.2910.camel@software1.logiplex.internal>
        <15924.32327.631412.57615@montanaro.dyndns.org>
        <m3iswbgog2.fsf@ferret.object-craft.com.au>
        <1043686924.25146.2997.camel@software1.logiplex.internal>
Message-ID: <15925.28186.949208.952742@montanaro.dyndns.org>


(Dave, should we continue to use the csv at object-craft address for you or
your djc email?)

    >> I suppose the first step would be to catalogue all of common the CSV
    >> variations and give them names.  Naming variations after the
    >> applications which produce them could be the best way.

    Cliff> That doesn't sound like a bad idea, but the task of cataloging
    Cliff> all those applications seems a bit daunting, especially since I
    Cliff> suspect between all of us we can probably only account for a
    Cliff> handful of them.

I think we should aim for Excel2000 compatibility as a bare minimum, and at
least document any supported extensions and try to tie them to specific
other applications.  It is indeed unfortunate that the CSV file format is
only operationally defined.

Wild-ass idea: Maybe the API should include a query function or a data
attribute which lists (as strings) the variants of CSV supported by a module
(which should be supported by test cases)?  The default variant would be
listed first, and the constructor would take any of the listed variants as
an optional argument.  Something like:

    variants = csv.get_variants()

    csvl = csv.parser(variant="lotus123")
    csve = csv.parser(variant="excel2000")

We could create an informal "registry" of valid variant names.  If support
for an existing variant is added, you use that name.  If support for an
unknown variant is added, you register a string.

    Cliff> ... despite there being no real standard, there seems to be only
    Cliff> minor differences between each format: delimiter, quote style,
    Cliff> allowed spaces around quotes.

That's true.  Perhaps selecting by variant name would do nothing more than
set those specific values behind the scenes, much the same way that when you
choose a particular C coding style in Emacs a number of low-level variable
values are set.

    Cliff> Another problem with specifying styles by application name is
    Cliff> that many apps allow the user to specify portions of the style
    Cliff> (usually the delimiter), so that's not set in stone either.

Yes, but there's still usually a default.  Some of the stuff (like space
after delimiters, newlines inside fields or CRLF/LF/CR line endings) isn't
user-settable and isn't obvious without inspecting the CSV file.  You might
have

    csve2 = csv.parser(variant="excel2000", delimiter=';')

to specify user-settable parameters or use "sniffing" code like DSV does to
figure out what the best choice is.

    Cliff> I think what I'm leaning towards at this time, if everyone is in
    Cliff> agreement, is for Dave or myself to reimplement Dave's code (and
    Cliff> API) in Python so that there is a pure Python implementation, and
    Cliff> then provide Dave's C module as a faster alternative (much like
    Cliff> Pickle and cPickle).  The heuristics of DSV would be an optional
    Cliff> feature, along with the GUI.

This sounds like a reasonable idea.  I also agree the GUI stuff will
probably not make it into the core.

    Cliff> As far as DSV's current API, I'm not too attached to it, and I
    Cliff> think that it could be mimicked sufficiently by adding a
    Cliff> parser.parseall() method to Dave's API so the programmer would
    Cliff> have the option of getting the entire file as a list without
    Cliff> having to write a loop.

Skip


From skip at pobox.com  Mon Jan 27 19:13:02 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon, 27 Jan 2003 12:13:02 -0600
Subject: ASV
Message-ID: <15925.30382.921990.566934@montanaro.dyndns.org>


I downloaded and installed Laurie Tratt's ASV module today and extended my
shootout script to try it.  It's considerably slower than DSV (by about 15x
on my sfsample.csv file, which makes it something like 75-150x slower than
csv) and doesn't appear to handle newlines within fields, generating 17 rows
instead of 13 on nastiness.csv.  It also seems to ignore all whitespace at
the beginning of fields, irregardless of field quoting, so for the first
line of nastiness.csv it returns

    ['Column1', 'Column2', 'Column3']

instead of

    ['Column1', 'Column2', ' Column3']

It does generate the same results as DSV and csv for my sfsample.csv script,
though that file is very well-behaved (fully quoted, no whitespace
surrounding delimiters).

I'm not aware that it has any interesting properties not available in either
DSV or csv, so I'm inclined to not consider it further.

Skip


From skip at pobox.com  Mon Jan 27 19:17:06 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon, 27 Jan 2003 12:17:06 -0600
Subject: delimiters...
Message-ID: <15925.30626.901414.610449@montanaro.dyndns.org>


I modified shootout.py to allow specification of alternate delimiters on the
command line and manually converted nastiness.csv to nastytabs.csv.
Processing nastytabs.csv with TAB as the delimiter generates identical
results as processing nastiness.csv with comma as the delimiter.  (This is a
good thing. ;-)

Nastytabs.csv and modified shootout.py attached.

Skip

-------------- next part --------------
A non-text attachment was scrubbed...
Name: nastytabs.csv
Type: application/octet-stream
Size: 600 bytes
Desc: not available
Url : http://mail.python.org/pipermail/csv/attachments/20030127/1d2e6d32/attachment.obj 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: shootout.py
Type: application/octet-stream
Size: 1083 bytes
Desc: not available
Url : http://mail.python.org/pipermail/csv/attachments/20030127/1d2e6d32/attachment-0001.obj 

From LogiplexSoftware at earthlink.net  Mon Jan 27 20:42:22 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 27 Jan 2003 11:42:22 -0800
Subject: DSVWizard.py
In-Reply-To: <15925.28186.949208.952742@montanaro.dyndns.org>
References: <15921.59181.892148.382610@montanaro.dyndns.org>
	 <KJEOLDOPMIDKCMJDCNDPKELECMAA.altis@semi-retired.com>
	 <15922.5903.628119.997022@montanaro.dyndns.org>
	 <1043622397.25146.2910.camel@software1.logiplex.internal>
	 <15924.32327.631412.57615@montanaro.dyndns.org>
	 <m3iswbgog2.fsf@ferret.object-craft.com.au>
	 <1043686924.25146.2997.camel@software1.logiplex.internal>
	 <15925.28186.949208.952742@montanaro.dyndns.org>
Message-ID: <1043696542.25139.3027.camel@software1.logiplex.internal>

On Mon, 2003-01-27 at 09:36, Skip Montanaro wrote:
> (Dave, should we continue to use the csv at object-craft address for you or
> your djc email?)
> 
>     >> I suppose the first step would be to catalogue all of common the CSV
>     >> variations and give them names.  Naming variations after the
>     >> applications which produce them could be the best way.
> 
>     Cliff> That doesn't sound like a bad idea, but the task of cataloging
>     Cliff> all those applications seems a bit daunting, especially since I
>     Cliff> suspect between all of us we can probably only account for a
>     Cliff> handful of them.
> 
> I think we should aim for Excel2000 compatibility as a bare minimum, and at
> least document any supported extensions and try to tie them to specific
> other applications.  It is indeed unfortunate that the CSV file format is
> only operationally defined.
> 
> Wild-ass idea: Maybe the API should include a query function or a data
> attribute which lists (as strings) the variants of CSV supported by a module
> (which should be supported by test cases)?  The default variant would be
> listed first, and the constructor would take any of the listed variants as
> an optional argument.  Something like:
> 
>     variants = csv.get_variants()
> 
>     csvl = csv.parser(variant="lotus123")
>     csve = csv.parser(variant="excel2000")
> 
> We could create an informal "registry" of valid variant names.  If support
> for an existing variant is added, you use that name.  If support for an
> unknown variant is added, you register a string.

Sounds reasonable, but I think the variant should be customizable in the
method call:

csvl = csv.parser(variant = "lotus123", delimiter = '\t')

So assuming "lotus123" was defined to use commas by default, it would
follow all the rules of the lotus variant except for the delimiter. 
This would allow for some flexibility in case the user saved the csv
file from Lotus but changed an option or two.

>     Cliff> ... despite there being no real standard, there seems to be only
>     Cliff> minor differences between each format: delimiter, quote style,
>     Cliff> allowed spaces around quotes.
> 
> That's true.  Perhaps selecting by variant name would do nothing more than
> set those specific values behind the scenes, much the same way that when you
> choose a particular C coding style in Emacs a number of low-level variable
> values are set.

That's what I was thinking.  In this case the "variant" could just be a
dictionary or simple class with a few attributes.

>     Cliff> Another problem with specifying styles by application name is
>     Cliff> that many apps allow the user to specify portions of the style
>     Cliff> (usually the delimiter), so that's not set in stone either.
> 
> Yes, but there's still usually a default.  Some of the stuff (like space
> after delimiters, newlines inside fields or CRLF/LF/CR line endings) isn't
> user-settable and isn't obvious without inspecting the CSV file.  You might
> have
> 
>     csve2 = csv.parser(variant="excel2000", delimiter=';')

Oh.  Guess I should have read the entire message before replying ;)  At
least it looks like we are on the same page =)

> to specify user-settable parameters or use "sniffing" code like DSV does to
> figure out what the best choice is.

The "sniffing" code in DSV is best used in conjunction with some sort of
confirmation from the user.  I've seen it guess incorrectly on some
files (although not very often).  Mostly stuff that has repeating
patterns of other characters (colons and slashes in dates and times). 
However, given these types of files, it defaults to the more common
delimiter (i.e. given a file that has both repeating colons and commas,
the comma will be chosen) which weeds out the majority of false
positives.  Nevertheless, it would seem foolhardy for a programmer to
rely on it without some sort of user intervention.  It could be perhaps
made a little smarter, but it's a difficult problem and I'd be reluctant
to use it alone.  This is why the GUI code is rather part-and-parcel
with the heuristics.  Nevertheless, having a separate project for
maintaining the GUI solves this and the programmer can always roll his
own if need be.

>     Cliff> I think what I'm leaning towards at this time, if everyone is in
>     Cliff> agreement, is for Dave or myself to reimplement Dave's code (and
>     Cliff> API) in Python so that there is a pure Python implementation, and
>     Cliff> then provide Dave's C module as a faster alternative (much like
>     Cliff> Pickle and cPickle).  The heuristics of DSV would be an optional
>     Cliff> feature, along with the GUI.
> 
> This sounds like a reasonable idea.  I also agree the GUI stuff will
> probably not make it into the core.

Anyone else?  BTW, where are we planning on hosting this project?  Under
one of the existing projects or somewhere else?

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308


From LogiplexSoftware at earthlink.net  Mon Jan 27 20:48:13 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 27 Jan 2003 11:48:13 -0800
Subject: ASV
In-Reply-To: <15925.30382.921990.566934@montanaro.dyndns.org>
References: <15925.30382.921990.566934@montanaro.dyndns.org>
Message-ID: <1043696893.25146.3034.camel@software1.logiplex.internal>

On Mon, 2003-01-27 at 10:13, Skip Montanaro wrote:
> I downloaded and installed Laurie Tratt's ASV module today and extended my
> shootout script to try it.  It's considerably slower than DSV (by about 15x
> on my sfsample.csv file, which makes it something like 75-150x slower than
> csv) and doesn't appear to handle newlines within fields, generating 17 rows
> instead of 13 on nastiness.csv.  It also seems to ignore all whitespace at
> the beginning of fields, irregardless of field quoting, so for the first
> line of nastiness.csv it returns
> 
>     ['Column1', 'Column2', 'Column3']
> 
> instead of
> 
>     ['Column1', 'Column2', ' Column3']
> 
> It does generate the same results as DSV and csv for my sfsample.csv script,
> though that file is very well-behaved (fully quoted, no whitespace
> surrounding delimiters).
> 
> I'm not aware that it has any interesting properties not available in either
> DSV or csv, so I'm inclined to not consider it further.

Agreed.  I assume the API didn't provide any interesting approaches
either?

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308


From LogiplexSoftware at earthlink.net  Mon Jan 27 21:05:49 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 27 Jan 2003 12:05:49 -0800
Subject: DSVWizard.py
In-Reply-To: <m3vg0bgoxm.fsf@ferret.object-craft.com.au>
References: <15921.59181.892148.382610@montanaro.dyndns.org>
	 <KJEOLDOPMIDKCMJDCNDPKELECMAA.altis@semi-retired.com>
	 <15922.5903.628119.997022@montanaro.dyndns.org>
	 <1043622397.25146.2910.camel@software1.logiplex.internal>
	 <15924.32327.631412.57615@montanaro.dyndns.org>
	 <1043632066.25146.2950.camel@software1.logiplex.internal>
	 <m3vg0bgoxm.fsf@ferret.object-craft.com.au>
Message-ID: <1043697949.25139.3051.camel@software1.logiplex.internal>

On Sun, 2003-01-26 at 21:08, Dave Cole wrote:
> > >     DSV: ['Test 1', 'Fred said "hey!", and left the room', '']
> > >     csv: ['Test 1', ' "Fred said ""hey!""', ' and left the room"', ' ""']
> > 
> > IMO, Dave's is incorrect in this one (unless he has specific reasons
> > otherwise).
> 
> Andrew (who has been included on th Cc) has tested the behaviour of
> Excel (such as it is) and we do the same thing as Excel.  As to
> whether Excel is doing the right thing, that is a different question
> entirely.

Okay.  So the default behavior would be to *not* treat the quotes as
text qualifiers in the following:

data, "data", data

unless the user specifies otherwise.

> One of the people we have done work for has some very nasty "CSV" data
> to parse.  We have been trying to work out what to do to the CSV
> module to handle some of the silly things he sees without breaking the
> Excel compatibility.

Having "variants" as Skip mentioned (and I think you did as well) would
solve this.

I'm also a bit curious as to the "Treat consecutive delimiters as one"
option in Excel.  I had planned to add support for that in DSV but never
got around to it.  Does csv have such an option?  Is this really ever
useful?  I've never had anyone request that I enable that option in DSV,
despite the fact that there's even a checkbox (disabled) for it in the
GUI.

> 
> > The original line (from the csv file) is:
> > 
> > Test 1, "Fred said ""hey!"", and left the room", ""
> > 
> > The "" at the end is an empty, quoted field.  Maybe someone should
> > run this through Excel to see what it claims (I'd be willing to
> > accept Dave's interpretation if Excel does it this way, although I'd
> > still feel it was incorrect).  I handled this case specifically at a
> > user's request.
> 
> Andrew, can you run that exact line through Excel?
> 
> > >     10
> > >     DSV: ['Test 9', 'no spaces around this', ' but single spaces around this ']
> > >     csv: ['Test 9', ' "no spaces around this" ', ' but single spaces around this ']
> > >     12
> > >     DSV: ['Test 11', 'has no spaces around anything', 'because the data is quoted']
> > >     csv: ['   "Test 11"  ', '   "has no spaces around anything"   ', '   "because the data is quoted"    ']
> > > 
> > > All the three lines have white space immediately following
> > > separating commas.  DSV appears to skip over this white space,
> > > while csv treats it as part of the field contents.
> 
> I am fairly sure that is what Excel does.

You're probably correct, but I'd like to be 100% certain on this.

> Pity there is no real specification for CSV.

Actually, it's only the V part of CSV that's poorly defined <wink>. 
Maybe CSV should stand for "comma separated vagueness".  
Speaking of names, since Kevin is correct in that people will look for
CSV since that's the common term, we could just define C to stand for
"character" rather than "comma", since this will be a general-purpose
importer.

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308


From skip at pobox.com  Mon Jan 27 22:02:23 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon, 27 Jan 2003 15:02:23 -0600
Subject: ASV
In-Reply-To: <1043696893.25146.3034.camel@software1.logiplex.internal>
References: <15925.30382.921990.566934@montanaro.dyndns.org>
        <1043696893.25146.3034.camel@software1.logiplex.internal>
Message-ID: <15925.40543.186264.281135@montanaro.dyndns.org>


    >> I'm not aware that it has any interesting properties not available in
    >> either DSV or csv, so I'm inclined to not consider it further.

    Cliff> Agreed.  I assume the API didn't provide any interesting
    Cliff> approaches either?

Not really.  In fact, I found it a bit confusing.  I couldn't figure out how
to specify an alternate delimiter either.  For some reason it appears Emacs
didn't save any intermediate backups of my shootout script, so I can't
cut-n-paste what I did use and am not going to fumble around to reproduce
it at this point.

Skip


From djc at object-craft.com.au  Tue Jan 28 00:22:28 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 28 Jan 2003 10:22:28 +1100
Subject: DSVWizard.py
In-Reply-To: <15925.28186.949208.952742@montanaro.dyndns.org>
References: <15921.59181.892148.382610@montanaro.dyndns.org>
	<KJEOLDOPMIDKCMJDCNDPKELECMAA.altis@semi-retired.com>
	<15922.5903.628119.997022@montanaro.dyndns.org>
	<1043622397.25146.2910.camel@software1.logiplex.internal>
	<15924.32327.631412.57615@montanaro.dyndns.org>
	<m3iswbgog2.fsf@ferret.object-craft.com.au>
	<1043686924.25146.2997.camel@software1.logiplex.internal>
	<15925.28186.949208.952742@montanaro.dyndns.org>
Message-ID: <m3hebuuqiz.fsf@ferret.object-craft.com.au>

>>>>> "Skip" == Skip Montanaro <skip at pobox.com> writes:

Skip> (Dave, should we continue to use the csv at object-craft address
Skip> for you or your djc email?)

Use the csv at object-craft.com.au address as it will ensure that Andrew
gets messages as well.  Andrew has spent considerable effort making
the CSV module conform to Excel behaviour.

Skip> I think we should aim for Excel2000 compatibility as a bare
Skip> minimum, and at least document any supported extensions and try
Skip> to tie them to specific other applications.  It is indeed
Skip> unfortunate that the CSV file format is only operationally
Skip> defined.

Skip> Wild-ass idea: Maybe the API should include a query function or
Skip> a data attribute which lists (as strings) the variants of CSV
Skip> supported by a module (which should be supported by test cases)?
Skip> The default variant would be listed first, and the constructor
Skip> would take any of the listed variants as an optional argument.
Skip> Something like:

Skip>     variants = csv.get_variants()

Skip>     csvl = csv.parser(variant="lotus123")
Skip>     csve = csv.parser(variant="excel2000")

What I think we should do is implement two layers; a Python layer and
an extension module.  The extension module should contain only the
functions which are necessary to implement a fast parser.

The Python layer would be the registry of variants and would configure
and tweak the parser.  This would allow all tweaking intelligence to
be hidden from the user while keeping implementation details out of
the parser.

Skip> We could create an informal "registry" of valid variant names.
Skip> If support for an existing variant is added, you use that name.
Skip> If support for an unknown variant is added, you register a
Skip> string.

I suppose a torture test is the first step in defining the variants.
Instead of trying to formally specify the variants up front we could
define them by the way they process the torture test.

Skip> That's true.  Perhaps selecting by variant name would do nothing
Skip> more than set those specific values behind the scenes, much the
Skip> same way that when you choose a particular C coding style in
Skip> Emacs a number of low-level variable values are set.

My thoughts exactly.

Cliff> Another problem with specifying styles by application name is
Cliff> that many apps allow the user to specify portions of the style
Cliff> (usually the delimiter), so that's not set in stone either.

In the first instance we have to assume that people are going to
choose styles which are not ambiguous.  This is a big assumption - I
have seen applications (database bulkcopy tools) which happily allow
you to export data which cannot be unambiguously parsed back into the
original fields/columns.

Cliff> I think what I'm leaning towards at this time, if everyone is
Cliff> in agreement, is for Dave or myself to reimplement Dave's code
Cliff> (and API) in Python so that there is a pure Python
Cliff> implementation, and then provide Dave's C module as a faster
Cliff> alternative (much like Pickle and cPickle).  The heuristics of
Cliff> DSV would be an optional feature, along with the GUI.

Shouldn't we first come up with a project plan.  If the eventual goal
is to get this into Python we are going to have to write a PEP.

Rather than trying to do everything ourselves we should try to think
of a method whereby we will get people to run a torture test against
the applications they need to interact with.

The steps would include (not sure about the order):

* Develop CSV torture test.

* Develop format by which people can submit results of torture test
  which will allow us to eventually regression test the parser against
  those results.

* Define Python API for CSV parser.

* Define extension module API.

* Write PEP.

* Develop CSV module.

Skip> This sounds like a reasonable idea.  I also agree the GUI stuff
Skip> will probably not make it into the core.

I agree.

Cliff> As far as DSV's current API, I'm not too attached to it, and I
Cliff> think that it could be mimicked sufficiently by adding a
Cliff> parser.parseall() method to Dave's API so the programmer would
Cliff> have the option of getting the entire file as a list without
Cliff> having to write a loop.

I think that we should be prepared to go back to the drawing board on
the API if necessary.  Once we have enough variants registered we will
be in a better position to come up with the "right" API.

- Dave

-- 
http://www.object-craft.com.au


From andrewm at object-craft.com.au  Tue Jan 28 00:25:00 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Tue, 28 Jan 2003 10:25:00 +1100
Subject: DSVWizard.py 
In-Reply-To: Message from Dave Cole <djc@object-craft.com.au> 
   of "27 Jan 2003 16:08:21 +1100." <m3vg0bgoxm.fsf@ferret.object-craft.com.au> 
References: <15921.59181.892148.382610@montanaro.dyndns.org> <KJEOLDOPMIDKCMJDCNDPKELECMAA.altis@semi-retired.com> <15922.5903.628119.997022@montanaro.dyndns.org> <1043622397.25146.2910.camel@software1.logiplex.internal> <15924.32327.631412.57615@montanaro.dyndns.org> <1043632066.25146.2950.camel@software1.logiplex.internal>  <m3vg0bgoxm.fsf@ferret.object-craft.com.au> 
Message-ID: <mailman.0.1159099089.10888.csv@python.org>

>> >     DSV: ['Test 1', 'Fred said "hey!", and left the room', '']
>> >     csv: ['Test 1', ' "Fred said ""hey!""', ' and left the room"', ' ""']
>> 
>> IMO, Dave's is incorrect in this one (unless he has specific reasons
>> otherwise).
>
>Andrew (who has been included on th Cc) has tested the behaviour of
>Excel (such as it is) and we do the same thing as Excel.  As to
>whether Excel is doing the right thing, that is a different question
>entirely.
[...]
>> The original line (from the csv file) is:
>> 
>> Test 1, "Fred said ""hey!"", and left the room", ""

Excel (at least, Excel 97) only gives the quote character a special
meaning when it appears directly after the field separator. In this
example, you have a space between the comma and the quote - removing
the space, CSV gives you:

    ['Test 1', 'Fred said "hey!", and left the room', '']

Older versions of CSV, in fact, behaved as DSV does (since that makes more
sense), but in the name of Excel compatibility...

>> The "" at the end is an empty, quoted field.  Maybe someone should
>> run this through Excel to see what it claims (I'd be willing to
>> accept Dave's interpretation if Excel does it this way, although I'd
>> still feel it was incorrect).  I handled this case specifically at a
>> user's request.
>
>Andrew, can you run that exact line through Excel?

Excel and CSV are behaving the same way on this line. As I mention above,
the space after the field separator is the problem.

I probably should add a "gobble leading space option" (sigh).

>> > All the three lines have white space immediately following
>> > separating commas.  DSV appears to skip over this white space,
>> > while csv treats it as part of the field contents.
>
>I am fairly sure that is what Excel does.

Indeed.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/


From djc at object-craft.com.au  Tue Jan 28 00:25:22 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 28 Jan 2003 10:25:22 +1100
Subject: DSVWizard.py
In-Reply-To: <1043696542.25139.3027.camel@software1.logiplex.internal>
References: <15921.59181.892148.382610@montanaro.dyndns.org>
	<KJEOLDOPMIDKCMJDCNDPKELECMAA.altis@semi-retired.com>
	<15922.5903.628119.997022@montanaro.dyndns.org>
	<1043622397.25146.2910.camel@software1.logiplex.internal>
	<15924.32327.631412.57615@montanaro.dyndns.org>
	<m3iswbgog2.fsf@ferret.object-craft.com.au>
	<1043686924.25146.2997.camel@software1.logiplex.internal>
	<15925.28186.949208.952742@montanaro.dyndns.org>
	<1043696542.25139.3027.camel@software1.logiplex.internal>
Message-ID: <m3d6miuqe5.fsf@ferret.object-craft.com.au>

>>>>> "Cliff" == Cliff Wells <LogiplexSoftware at earthlink.net> writes:

>>  This sounds like a reasonable idea.  I also agree the GUI stuff
>> will probably not make it into the core.

Cliff> Anyone else?  BTW, where are we planning on hosting this
Cliff> project?  Under one of the existing projects or somewhere else?

If we are trying to get this into Python shouldn't we use something
like sourceforge.  Has anyone been through the process of getting code
into Python before?

- Dave

-- 
http://www.object-craft.com.au


From altis at semi-retired.com  Tue Jan 28 00:35:45 2003
From: altis at semi-retired.com (Kevin Altis)
Date: Mon, 27 Jan 2003 15:35:45 -0800
Subject: DSVWizard.py
In-Reply-To: <m3d6miuqe5.fsf@ferret.object-craft.com.au>
Message-ID: <KJEOLDOPMIDKCMJDCNDPMEOLCMAA.altis@semi-retired.com>

> From: Dave Cole
>
> >>>>> "Cliff" == Cliff Wells <LogiplexSoftware at earthlink.net> writes:
>
> >>  This sounds like a reasonable idea.  I also agree the GUI stuff
> >> will probably not make it into the core.
>
> Cliff> Anyone else?  BTW, where are we planning on hosting this
> Cliff> project?  Under one of the existing projects or somewhere else?
>
> If we are trying to get this into Python shouldn't we use something
> like sourceforge.  Has anyone been through the process of getting code
> into Python before?

Either just use the Python DSV project Cliff already has setup

http://sourceforge.net/projects/python-dsv

or create a new one python-csv

Either way, everyone should have write privs. and a new cvs dir needs to be
created to hold the working code.

Originally, I thought the task of making a standard module was going to be
relatively trivial, but I'm guessing now that there will be enough effort
required in deciding on the API, test cases, a PEP, etc. that it won't be
appropriate to try and make it part of Python 2.3, but will have to wait for
Python 2.4 instead. So, in the meantime, the project will just follow the
lead of other projects prior to being incorporated in the Python core.

Skip has the most experience in this area, do you agree with the assessment
above Skip?

Public discussions can take place on the db-sig and/or c.l.py

ka


From djc at object-craft.com.au  Tue Jan 28 00:35:55 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 28 Jan 2003 10:35:55 +1100
Subject: DSVWizard.py
In-Reply-To: <1043697949.25139.3051.camel@software1.logiplex.internal>
References: <15921.59181.892148.382610@montanaro.dyndns.org>
	<KJEOLDOPMIDKCMJDCNDPKELECMAA.altis@semi-retired.com>
	<15922.5903.628119.997022@montanaro.dyndns.org>
	<1043622397.25146.2910.camel@software1.logiplex.internal>
	<15924.32327.631412.57615@montanaro.dyndns.org>
	<1043632066.25146.2950.camel@software1.logiplex.internal>
	<m3vg0bgoxm.fsf@ferret.object-craft.com.au>
	<1043697949.25139.3051.camel@software1.logiplex.internal>
Message-ID: <m38yx6upwk.fsf@ferret.object-craft.com.au>

>>>>> "Cliff" == Cliff Wells <LogiplexSoftware at earthlink.net> writes:

Cliff> Okay.  So the default behavior would be to *not* treat the
Cliff> quotes as text qualifiers in the following:

Cliff> data, "data", data

Cliff> unless the user specifies otherwise.

I believe that is how Excel works.

>> One of the people we have done work for has some very nasty "CSV"
>> data to parse.  We have been trying to work out what to do to the
>> CSV module to handle some of the silly things he sees without
>> breaking the Excel compatibility.

Cliff> Having "variants" as Skip mentioned (and I think you did as
Cliff> well) would solve this.

Cliff> I'm also a bit curious as to the "Treat consecutive delimiters
Cliff> as one" option in Excel.  I had planned to add support for that
Cliff> in DSV but never got around to it.  Does csv have such an
Cliff> option?  Is this really ever useful?  I've never had anyone
Cliff> request that I enable that option in DSV, despite the fact that
Cliff> there's even a checkbox (disabled) for it in the GUI.

I suppose there is no reason why we could not allow people to invoke
variants like this;

        p = csv.parser(app='Excel', consecutive_delimiters=1)

The API could be as simple as

        def parser(**kwargs):
            app = kwargs.get('app', 'Excel')

Cliff> Actually, it's only the V part of CSV that's poorly defined
Cliff> <wink>.  Maybe CSV should stand for "comma separated
Cliff> vagueness".

LOL.

Cliff> Speaking of names, since Kevin is correct in that people will
Cliff> look for CSV since that's the common term, we could just define
Cliff> C to stand for "character" rather than "comma", since this will
Cliff> be a general-purpose importer.

Or use both.  As long as you use include "comma separated values" and
"character separated values" google will find it.

- Dave

-- 
http://www.object-craft.com.au


From skip at pobox.com  Tue Jan 28 00:59:42 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon, 27 Jan 2003 17:59:42 -0600
Subject: DSVWizard.py
In-Reply-To: <m3hebuuqiz.fsf@ferret.object-craft.com.au>
References: <15921.59181.892148.382610@montanaro.dyndns.org>
        <KJEOLDOPMIDKCMJDCNDPKELECMAA.altis@semi-retired.com>
        <15922.5903.628119.997022@montanaro.dyndns.org>
        <1043622397.25146.2910.camel@software1.logiplex.internal>
        <15924.32327.631412.57615@montanaro.dyndns.org>
        <m3iswbgog2.fsf@ferret.object-craft.com.au>
        <1043686924.25146.2997.camel@software1.logiplex.internal>
        <15925.28186.949208.952742@montanaro.dyndns.org>
        <m3hebuuqiz.fsf@ferret.object-craft.com.au>
Message-ID: <15925.51182.467769.765511@montanaro.dyndns.org>


    Dave> Shouldn't we first come up with a project plan.  If the eventual
    Dave> goal is to get this into Python we are going to have to write a
    Dave> PEP.

I'm working on a PEP... ;-) This thread is all good grist for the mill.
I'll try to get something minimal you can throw tomatoes at tonight or
tomorrow.

    Dave> * Define Python API for CSV parser.

    Dave> * Define extension module API.

I'm not sure you need to define an extension module API.  I view the
extension module is essentially an implementation detail.

    Cliff> As far as DSV's current API, I'm not too attached to it, and I
    Cliff> think that it could be mimicked sufficiently by adding a
    Cliff> parser.parseall() method to Dave's API so the programmer would
    Cliff> have the option of getting the entire file as a list without
    Cliff> having to write a loop.

    Dave> I think that we should be prepared to go back to the drawing board
    Dave> on the API if necessary.  Once we have enough variants registered
    Dave> we will be in a better position to come up with the "right" API.

Hmmm...  I'd like to get something into 2.3 without a wholesale rewrite if
possible.  I see two basic operations:

    * suck the contents of a file-like object opened for reading into a list
      of lists (or iterable returning lists)

    * write a list of lists to to a file-like object opened for writing

I view the rest of the API as essentially just tweaks to the formatting
parameters.

I think Dave's csv module (should I be calling it Object Craft's csv module?
I don't mean to slight other contributors) is fairly close to this already,
though it would be nice to be able to read a CSV file like so:

    import csv

    csvreader = csv.parser(file("nastiness.csv"))
    # csvreader.setparams(dialect="excel2000", quote='"', delimiter='/')

    for row in csvreader:
        process(row)

and write it like so:

    import csv

    csvwriter = csv.writer(file("newnastiness.csv", "w"))
    # csvwriter.setparams(dialect="lotus123", quote='"', delimiter='/')

    for row in someiterable:
        csvwriter.write(row)

The .setparams() method can obviously be collapsed into the constructors.

I could thus implement a CSV dialect converter (do others like "dialect"
better than "variant"?) thus:

    import csv

    csvreader = csv.parser(file("nastiness.csv"), dialect="excel2000")
    csvwriter = csv.writer(file("newnastiness.csv", "w"),
                           dialect="lotus123", delimiter='/')

    for row in csvreader:
        csvwriter.write(row)

Skip


From skip at pobox.com  Tue Jan 28 01:03:21 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon, 27 Jan 2003 18:03:21 -0600
Subject: DSVWizard.py
In-Reply-To: <m3d6miuqe5.fsf@ferret.object-craft.com.au>
References: <15921.59181.892148.382610@montanaro.dyndns.org>
        <KJEOLDOPMIDKCMJDCNDPKELECMAA.altis@semi-retired.com>
        <15922.5903.628119.997022@montanaro.dyndns.org>
        <1043622397.25146.2910.camel@software1.logiplex.internal>
        <15924.32327.631412.57615@montanaro.dyndns.org>
        <m3iswbgog2.fsf@ferret.object-craft.com.au>
        <1043686924.25146.2997.camel@software1.logiplex.internal>
        <15925.28186.949208.952742@montanaro.dyndns.org>
        <1043696542.25139.3027.camel@software1.logiplex.internal>
        <m3d6miuqe5.fsf@ferret.object-craft.com.au>
Message-ID: <15925.51401.651798.820598@montanaro.dyndns.org>


    Cliff> BTW, where are we planning on hosting this project?  Under one of
    Cliff> the existing projects or somewhere else?

    Dave> If we are trying to get this into Python shouldn't we use
    Dave> something like sourceforge.  Has anyone been through the process
    Dave> of getting code into Python before?

I have checkin privileges on the Python repository.  I doubt it will be
difficult to get all of you set up similarly.  The Python CVS sandbox would
then make a logical place to host it:

    http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/nondist/sandbox/

I can just create a "csv" subdirectory there to get us started.

Skip


From djc at object-craft.com.au  Tue Jan 28 01:23:37 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 28 Jan 2003 11:23:37 +1100
Subject: DSVWizard.py
In-Reply-To: <15925.51182.467769.765511@montanaro.dyndns.org>
References: <15921.59181.892148.382610@montanaro.dyndns.org>
	<KJEOLDOPMIDKCMJDCNDPKELECMAA.altis@semi-retired.com>
	<15922.5903.628119.997022@montanaro.dyndns.org>
	<1043622397.25146.2910.camel@software1.logiplex.internal>
	<15924.32327.631412.57615@montanaro.dyndns.org>
	<m3iswbgog2.fsf@ferret.object-craft.com.au>
	<1043686924.25146.2997.camel@software1.logiplex.internal>
	<15925.28186.949208.952742@montanaro.dyndns.org>
	<m3hebuuqiz.fsf@ferret.object-craft.com.au>
	<15925.51182.467769.765511@montanaro.dyndns.org>
Message-ID: <m3u1fut94m.fsf@ferret.object-craft.com.au>

>>>>> "Skip" == Skip Montanaro <skip at pobox.com> writes:

Skip> I'm working on a PEP... ;-) This thread is all good grist for
Skip> the mill.  I'll try to get something minimal you can throw
Skip> tomatoes at tonight or tomorrow.

Excellent.

Dave> * Define Python API for CSV parser.

Dave> * Define extension module API.

Skip> I'm not sure you need to define an extension module API.  I view
Skip> the extension module is essentially an implementation detail.

True.

Skip> Hmmm...  I'd like to get something into 2.3 without a wholesale
Skip> rewrite if possible.  I see two basic operations:

Skip> * suck the contents of a file-like object opened for reading
Skip>   into a list of lists (or iterable returning lists)

Skip> * write a list of lists to to a file-like object opened for
Skip>   writing

Skip> I view the rest of the API as essentially just tweaks to the
Skip> formatting parameters.

Sounds easy :-)

Skip> I think Dave's csv module (should I be calling it Object Craft's
Skip> csv module?  I don't mean to slight other contributors)

Call it Object Craft's.  I did the initial work but Andrew has his
fingerprints all over it now.

Skip> import csv
Skip> 
Skip> csvreader = csv.parser(file("nastiness.csv"))
Skip> # csvreader.setparams(dialect="excel2000", quote='"', delimiter='/')
Skip> 
Skip> for row in csvreader:
Skip>     process(row)

That is a really nice interface.  I like it a lot.

Skip> import csv
Skip> 
Skip> csvwriter = csv.writer(file("newnastiness.csv", "w"))
Skip> # csvwriter.setparams(dialect="lotus123", quote='"', delimiter='/')
Skip> 
Skip> for row in someiterable:
Skip>     csvwriter.write(row)

Very nice.

Skip> The .setparams() method can obviously be collapsed into the
Skip> constructors.
Skip> 
Skip> I could thus implement a CSV dialect converter (do others like
Skip> "dialect" better than "variant"?) thus:
Skip> 
Skip>     import csv
Skip> 
Skip>     csvreader = csv.parser(file("nastiness.csv"), dialect="excel2000")
Skip>     csvwriter = csv.writer(file("newnastiness.csv", "w"),
Skip>                            dialect="lotus123", delimiter='/')
Skip> 
Skip>     for row in csvreader:
Skip>         csvwriter.write(row)

This is excellent stuff.  I am not very good at naming, but "dialect"
looks good to me.

- Dave

-- 
http://www.object-craft.com.au


From djc at object-craft.com.au  Tue Jan 28 01:43:33 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 28 Jan 2003 11:43:33 +1100
Subject: DSVWizard.py
In-Reply-To: <15925.51401.651798.820598@montanaro.dyndns.org>
References: <15921.59181.892148.382610@montanaro.dyndns.org>
	<KJEOLDOPMIDKCMJDCNDPKELECMAA.altis@semi-retired.com>
	<15922.5903.628119.997022@montanaro.dyndns.org>
	<1043622397.25146.2910.camel@software1.logiplex.internal>
	<15924.32327.631412.57615@montanaro.dyndns.org>
	<m3iswbgog2.fsf@ferret.object-craft.com.au>
	<1043686924.25146.2997.camel@software1.logiplex.internal>
	<15925.28186.949208.952742@montanaro.dyndns.org>
	<1043696542.25139.3027.camel@software1.logiplex.internal>
	<m3d6miuqe5.fsf@ferret.object-craft.com.au>
	<15925.51401.651798.820598@montanaro.dyndns.org>
Message-ID: <m3bs22t87e.fsf@ferret.object-craft.com.au>

>>>>> "Skip" == Skip Montanaro <skip at pobox.com> writes:

Cliff> BTW, where are we planning on hosting this project?  Under one
Cliff> of the existing projects or somewhere else?

Dave> If we are trying to get this into Python shouldn't we use
Dave> something like sourceforge.  Has anyone been through the process
Dave> of getting code into Python before?

Skip> I have checkin privileges on the Python repository.  I doubt it
Skip> will be difficult to get all of you set up similarly.  The
Skip> Python CVS sandbox would then make a logical place to host it:

Skip>
Skip> http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/nondist/sandbox/

Skip> I can just create a "csv" subdirectory there to get us started.

I like that plan.

I would be more than happy to have our code moved into the sandbox
with the goal of having this go into Python 2.3.

Unless I am missing the point, I assume you plan to have something
like the following as a starting point:

* A new csv.py Python module which exports the interface defined in
  the PEP.

* Our current CSV parser renamed to something like _csvparser.

* The torture test.

- Dave

-- 
http://www.object-craft.com.au


From skip at pobox.com  Tue Jan 28 02:57:05 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon, 27 Jan 2003 19:57:05 -0600
Subject: SF ids please
Message-ID: <15925.58225.712028.494438@montanaro.dyndns.org>

Please confirm your Sourceforge usernames for me:

    Dave Cole                   davecole
    Cliff Wells                 cliffwells18
    Kevin Altis                 kasplat

I will see about getting you checkin privileges for Python CVS.  Dave, what
about Andrew?

Skip


From djc at object-craft.com.au  Tue Jan 28 03:06:18 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 28 Jan 2003 13:06:18 +1100
Subject: SF ids please
In-Reply-To: <15925.58225.712028.494438@montanaro.dyndns.org>
References: <15925.58225.712028.494438@montanaro.dyndns.org>
Message-ID: <m3wukqrpt1.fsf@ferret.object-craft.com.au>


> Please confirm your Sourceforge usernames for me:
>     Dave Cole                   davecole

That is me.

Andrew is getting an account set up now.

- Dave

-- 
http://www.object-craft.com.au


From andrewm at object-craft.com.au  Tue Jan 28 03:12:07 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Tue, 28 Jan 2003 13:12:07 +1100
Subject: SF ids please 
In-Reply-To: Message from Dave Cole <djc@object-craft.com.au> 
   of "28 Jan 2003 13:06:18 +1100." <m3wukqrpt1.fsf@ferret.object-craft.com.au> 
References: <15925.58225.712028.494438@montanaro.dyndns.org>  <m3wukqrpt1.fsf@ferret.object-craft.com.au> 
Message-ID: <mailman.1.1159099089.10888.csv@python.org>

>> Please confirm your Sourceforge usernames for me:
>
>Andrew is getting an account set up now.

Done: "andrewmcnamara"

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/


From andrewm at object-craft.com.au  Tue Jan 28 03:12:07 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Tue, 28 Jan 2003 13:12:07 +1100
Subject: SF ids please 
In-Reply-To: Message from Dave Cole <djc@object-craft.com.au> 
   of "28 Jan 2003 13:06:18 +1100." <m3wukqrpt1.fsf@ferret.object-craft.com.au> 
References: <15925.58225.712028.494438@montanaro.dyndns.org>  <m3wukqrpt1.fsf@ferret.object-craft.com.au> 
Message-ID: <20030128021207.5D8AA3C1F4@coffee.object-craft.com.au>

>> Please confirm your Sourceforge usernames for me:
>
>Andrew is getting an account set up now.

Done: "andrewmcnamara"

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/


From skip at pobox.com  Tue Jan 28 03:23:43 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon, 27 Jan 2003 20:23:43 -0600
Subject: sandbox created
Message-ID: <15925.59823.804408.159618@montanaro.dyndns.org>


I created the sandbox with a handful of stub files.  You can browse them at

    http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/nondist/sandbox/csv

I've also asked the Python admins for checkin privileges for each of you.

Skip


From skip at pobox.com  Tue Jan 28 04:15:32 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon, 27 Jan 2003 21:15:32 -0600
Subject: Checkin privileges for a few other people please? (fwd)
Message-ID: <15925.62932.544759.35012@montanaro.dyndns.org>


Hey folks,

Guido says it's a go if you're cool with the PSF license.  This will likely
affect your current code.  Let me know, yea or nay.

Skip

-------------- next part --------------
An embedded message was scrubbed...
From: Guido van Rossum <guido at python.org>
Subject: Re: Checkin privileges for a few other people please?
Date: Mon, 27 Jan 2003 21:24:55 -0500
Size: 5804
Url: http://mail.python.org/pipermail/csv/attachments/20030127/b0896ab6/attachment.mht 

From djc at object-craft.com.au  Tue Jan 28 04:57:07 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 28 Jan 2003 14:57:07 +1100
Subject: Checkin privileges for a few other people please? (fwd)
In-Reply-To: <15925.62932.544759.35012@montanaro.dyndns.org>
References: <15925.62932.544759.35012@montanaro.dyndns.org>
Message-ID: <m3d6mirkoc.fsf@ferret.object-craft.com.au>

>>>>> "Skip" == Skip Montanaro <skip at pobox.com> writes:

Skip> Guido says it's a go if you're cool with the PSF license.  This
Skip> will likely affect your current code.  Let me know, yea or nay.

I have skimmed through the psf-contributor-agreement.  It looks like
we lose nothing by contributing - we just grant PSF equal copyright.
That is fine by us (that is a yea).

I suppose we should fax some signed copies of the various agreements.

- Dave

Guido> I'd like to make sure that they will assign the copyright to the
Guido> PSF.  This is especially important since two of these are
Guido> already authors of 3rd party code with possibly different
Guido> licenses.  All new code in the Python CVS *must* be under the
Guido> standard PSF license.

Guido> If they all agree with the drafts at

Guido> http://www.python.org/psf/psf-contributor-agreement.html

Guido> it's a deal, as far as I'm concerned.  (Oh, and the usual
Guido> caution for checking in outside the area for which they are
Guido> responsible.)

-- 
http://www.object-craft.com.au


From djc at object-craft.com.au  Tue Jan 28 05:08:00 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 28 Jan 2003 15:08:00 +1100
Subject: sandbox created
In-Reply-To: <15925.59823.804408.159618@montanaro.dyndns.org>
References: <15925.59823.804408.159618@montanaro.dyndns.org>
Message-ID: <m38yx5syqn.fsf@ferret.object-craft.com.au>

>>>>> "Skip" == Skip Montanaro <skip at pobox.com> writes:

Skip> I created the sandbox with a handful of stub files.  You can
Skip> browse them at

Skip> http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/nondist/sandbox/csv

Skip> I've also asked the Python admins for checkin privileges for
Skip> each of you.

If Skip is prepared to do it, I think he should act as project leader.
I think that it is important to have someone who does not have a
personal attachment to any existing code.

I have my own ideas about how we should proceed.  I suspect I am not
alone :-)

- Dave

-- 
http://www.object-craft.com.au


From skip at pobox.com  Tue Jan 28 05:20:23 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon, 27 Jan 2003 22:20:23 -0600
Subject: First Cut at CSV PEP
Message-ID: <15926.1287.36487.12649@montanaro.dyndns.org>


I'm ready to toddle off to bed, so I'm stopping here for tonight.  Attached
is what I've come up with so far in the way of a PEP.  Feel free to flesh
out, rewrite or add new sections.  After a brief amount of cycling, I'll
check it into CVS.

Skip

-------------- next part --------------
PEP: NNN
Title: CSV file API
Version: $Revision: $
Last-Modified: $Date: $
Author: Skip Montanaro <skip at pobox.com>,
        Kevin Altis <altis at semi-retired.com>,
        Cliff Wells <LogiplexSoftware at earthlink.net>
Status: Active
Type: Draft
Content-Type: text/x-rst
Created: 26-Jan-2003
Python-Version: 2.3
Post-History: 


Abstract
========

The Comma Separated Values (CSV) file format is the most common import
and export format for spreadsheets and databases.  Although many CSV
files are simple to parse, the format is not formally defined by a
stable specification and is subtle enough that parsing lines of a CSV
file with something like ``line.split(",")`` is bound to fail.  This
PEP defines an API for reading and writing CSV files which should make
it possible for programmers to select a CSV module which meets their
requirements.


Existing Modules
================

Three widely available modules enable programmers to read and write
CSV files:

- Dave Cole's csv module [1]_

- Cliff Wells's Python-DSV module [2]_

- Laurence Tratt's ASV module [3]_

They have different APIs, making it somewhat difficult for programmers
to switch between them.  More of a problem may be that they interpret
some of the CSV corner cases differently, so even after surmounting
the differences in the module APIs, the programmer has to also deal
with semantic differences between the packages.


Rationale
=========

By defining common APIs for reading and writing CSV files, we make it
easier for programmers to choose an appropriate module to suit their
needs, and make it easier to switch between modules if their needs
change.  This PEP also forms a set of requirements for creation of a
module which will hopefully be incorporated into the Python
distribution.


Module Interface
================

The module supports two basic APIs, one for reading and one for
writing.  The reading interface is::

    reader(fileobj [, dialect='excel2000']
                   [, quotechar='"']
                   [, delimiter=',']
                   [, skipinitialspace=False])

A reader object is an iterable which takes a file-like object opened
for reading as the sole required parameter.  It also accepts four
optional parameters (discussed below).  Readers are typically used as
follows::

    csvreader = csv.parser(file("some.csv"))
    for row in csvreader:
        process(row)

The writing interface is similar::

    writer(fileobj [, dialect='excel2000']
                   [, quotechar='"']
                   [, delimiter=',']
                   [, skipinitialspace=False])

A writer object is a wrapper around a file-like object opened for
writing.  It accepts the same four optional parameters as the reader
constructor.  Writers are typically used as follows::

    csvwriter = csv.writer(file("some.csv", "w"))
    for row in someiterable:
        csvwriter.write(row)


Optional Parameters
-------------------

Both the reader and writer constructors take four optional keyword
parameters::

- dialect is an easy way of specifying a complete set of format
  constraints for a reader or writer.  Most people will know what
  application generated a CSV file or what application will process
  the CSV file they are generating, but not the precise settings
  necessary.  The only dialect defined initially is "excel2000".  The
  dialect parameter is interpreted in a case-insensitive manner.

- quotechar specifies a one-character string to use as the quoting
  character.  It defaults to '"'.

- delimiter specifies a one-character string to use as the field
  separator.  It defaults to ','.

- skipinitialspace specifies how to interpret whitespace which
  immediately follows a selimiter.  It defaults to False, which means
  that whitespace immediate following a delimiter is part of the
  following field.

When processing a dialect setting and one or more of the other
optional parameters, the dialect parameter is processed first, then
the others are processed.  This makes it easy to choose a dialect,
then override one or more of the settings.  For example, if a CSV file
was generated by Excel 2000 using single quotes as the quote
character, you could create a reader like::

    csvreader = csv.parser(file("some.csv"), dialect="excel2000",
                           quotechar="'")


Testing
=======

TBD.


Issues
======

- Should a parameter control how consecutive delimiters are
  interpreted?  (My thought is "no".)


References
==========

.. [1] csv module, Object Craft
   (http://www.object-craft.com.au/projects/csv) 

.. [2] Python-DSV module, Wells
   (http://sourceforge.net/projects/python-dsv/) 

.. [3] ASV module, Tratt
   (http://tratt.net/laurie/python/asv/)


Copyright
=========

This document has been placed in the public domain.


..
   Local Variables:
   mode: indented-text
   indent-tabs-mode: nil
   sentence-end-double-space: t
   fill-column: 70
   End:

From djc at object-craft.com.au  Tue Jan 28 05:56:39 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 28 Jan 2003 15:56:39 +1100
Subject: First Cut at CSV PEP
In-Reply-To: <15926.1287.36487.12649@montanaro.dyndns.org>
References: <15926.1287.36487.12649@montanaro.dyndns.org>
Message-ID: <m3d6mhrhx4.fsf@ferret.object-craft.com.au>

>>>>> "Skip" == Skip Montanaro <skip at pobox.com> writes:

Skip> I'm ready to toddle off to bed, so I'm stopping here for
Skip> tonight.  Attached is what I've come up with so far in the way
Skip> of a PEP.  Feel free to flesh out, rewrite or add new sections.
Skip> After a brief amount of cycling, I'll check it into CVS.

I only have one issue with the PEP as it stands.  It is still aiming
too low.  One of the things that we support in our parser is the
ability to handle CSV without quote characters.

        field1,field2,field3\, field3,field4

One of our customers has data like the above.  To handle this we would
need something like the following:

    # Use the 'raw' dialect to get access to all tweakables.
    writer(fileobj,
           dialect='raw', quotechar=None, delimiter=',', escapechar='\\')

I think that we need some way to handle a potentially different set of
options on each dialect.

When you CSV export from Excel, do you have the ability to use a
delimiter other than comma?  Do you have the ability to change the
quotechar?

Should the wrapper protect you from yourself so that when you select
the Excel dialect you are limited to the options available within
Excel?

Maybe the dialect should not limit you, it should just provide the
correct defaults.

Since we are going to have one parsing engine in an extension module
below the Python layer, we are probably going to evolve more tweakable
settings in the parser over time.  It would be nice if we could hide
new tweakables from application code by associating defaults values
with dialect names in the Python layer.  We should not be exposing the
low level parser interface to user code if it can be avoided.

- Dave

-- 
http://www.object-craft.com.au


From altis at semi-retired.com  Tue Jan 28 06:50:20 2003
From: altis at semi-retired.com (Kevin Altis)
Date: Mon, 27 Jan 2003 21:50:20 -0800
Subject: First Cut at CSV PEP
In-Reply-To: <m3d6mhrhx4.fsf@ferret.object-craft.com.au>
Message-ID: <KJEOLDOPMIDKCMJDCNDPIEPECMAA.altis@semi-retired.com>

> From: Dave Cole
>
> >>>>> "Skip" == Skip Montanaro <skip at pobox.com> writes:
>
> I only have one issue with the PEP as it stands.  It is still aiming
> too low.  One of the things that we support in our parser is the
> ability to handle CSV without quote characters.
>
>         field1,field2,field3\, field3,field4

Excel certainly can't handle that, nor do I think Access can. If a field
contains a comma, then the field must be quoted. Now, that isn't to say that
we shouldn't be able to support the idea of escaped characters, but when
exporting if you do want something that a tool like Excel could read, you
would need to generate an exception if quoting wasn't specified. The same
would probably apply for embedded newlines in a field without quoting.

Being able to generate exceptions on import and export operations could be
one of the big benefits of this module. You won't accidentally export
something that someone on the other end won't be able to use and you'll know
on import that you have garbage before you try and use it. For example, when
I first started trying to import Access data that was tab-separated, I
didn't realize there were embedded newlines until much later, at which point
I was able to go back and export as CSV with quote delimitters and the data
became usable.

> I think that we need some way to handle a potentially different set of
> options on each dialect.

I'm not real comfortable with the dialect idea, it doesn't seem to add any
value over simply specifying a separator and delimiter.

We aren't dealing with encodings, so anything other than 7-bit ASCII unless
specified as a delimiter or separator would be undefined, yes? The only
thing that really matters is the delimiter and separator and then how
quoting is handled of either of those characters and embedded returns and
newlines within a field. Correct me if I'm wrong, but I don't think the MS
CSV formats can deal with embedded CR or LF unless fields are quoted and
that will be done with a " character.

Now with Access, you are actually given more control. See the attached
screenshot. Ignorning everything except the top File format section you
have:
Delimited or Fixed Width. If Delimited you have a Field Delimiter choice of
comma, semi-colon, tab and space or a user-specified character and the text
qualifier can be double-quote, apostrophe, or None.

> When you CSV export from Excel, do you have the ability to use a
> delimiter other than comma?  Do you have the ability to change the
> quotechar?

No, but there are a variety of text formats supported.

The Excel 2000 help file for Text file formats:

"Text (Tab-delimited) (*.txt) (Windows)
Text (Macintosh)
Text (OS/2 or MS-DOS)
CSV (comma delimited) (*.csv) (Windows)
CSV (Macintosh)
CSV (OS/2 or MS-DOS)

If you are saving a workbook as a tab-delimited or comma-delimited text file
for use on another operating system, select the appropriate converter to
ensure that tab characters, line breaks, and other characters are
interpreted correctly."

The Excel 2000 help file for CSV:

"CSV (Comma delimited) format
The CSV (Comma delimited) file format saves only the text and values as they
are displayed in cells of the active worksheet. All rows and all characters
in each cell are saved. Columns of data are separated by commas, and each
row of data ends in a carriage return. If a cell contains a comma, the cell
contents are enclosed in double quotation marks.

If cells display formulas instead of formula values, the formulas are
converted as text. All formatting, graphics, objects, and other worksheet
contents are lost.

Note   If your workbook contains special font characters such as a copyright
symbol (C), and you will be using the converted text file on a computer with
a different operating system, save the workbook in the text file format
appropriate for that system. For example, if you are using Windows and want
to use the text file on a Macintosh computer, save the file in the CSV
(Macintosh) format. If you are using a Macintosh computer and want to use
the text file on a system running Windows or Windows NT, save the file in
the CSV (Windows) format."

The CR, CR/LF, and LF line endings probably have something to do with saving
in Mac format, but it may also do some 8-bit character translation.

The universal readlines support in Python 2.3 may impact the use of a file
reader/writer when processing different text files, but would returns or
newlines within a field be impacted? Should the PEP and API specify that the
record delimiter can be either CR, LF, or CR/LF, but use of those characters
inside a field requires the field to be quoted or an exception will be
thrown?

ka
-------------- next part --------------
A non-text attachment was scrubbed...
Name: access_export.png
Type: image/png
Size: 9504 bytes
Desc: not available
Url : http://mail.python.org/pipermail/csv/attachments/20030127/7594f034/attachment.png 

From altis at semi-retired.com  Tue Jan 28 07:39:28 2003
From: altis at semi-retired.com (Kevin Altis)
Date: Mon, 27 Jan 2003 22:39:28 -0800
Subject: various CVS references
In-Reply-To: <15926.1287.36487.12649@montanaro.dyndns.org>
Message-ID: <KJEOLDOPMIDKCMJDCNDPGEPFCMAA.altis@semi-retired.com>

Just for reference some Google searches of "cvs spec" "comma separated
values" and some other variants produced

Java
http://ostermiller.org/utils/CSVLexer.html


Perl
http://rath.ca/Misc/Perl_CSV/

http://rath.ca/Misc/Perl_CSV/CSV-2.0.html#csv%20specification

A search on CPAN for csv yields a lot of different modules, some with test
data.

http://theoryx5.uwinnipeg.ca/mod_perl/cpan-search?request=search


The TCL standard libs (whatever those are ;-) has a module

http://tcllib.sourceforge.net/doc/csv.html


MSDN references

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/netdir/ad/c
omma-separated_value_csv_scripts.asp

There are a variety of other things on MSDN, none of which seem particularly
helpful. Apparently, the MS Commerce server actually contains ImportCSV and
ExportCSV methods. I'm still searching to see if I can find further MS
qualifications of CSV and/or tab-delimitted formats as supported by various
tools.

ka


From altis at semi-retired.com  Tue Jan 28 07:43:22 2003
From: altis at semi-retired.com (Kevin Altis)
Date: Mon, 27 Jan 2003 22:43:22 -0800
Subject: First Cut at CSV PEP
In-Reply-To: <15926.1287.36487.12649@montanaro.dyndns.org>
Message-ID: <KJEOLDOPMIDKCMJDCNDPMEPFCMAA.altis@semi-retired.com>

> I'm ready to toddle off to bed, so I'm stopping here for tonight.
>  Attached
> is what I've come up with so far in the way of a PEP.  Feel free to flesh
> out, rewrite or add new sections.  After a brief amount of cycling, I'll
> check it into CVS.

Probably need to specify that input and output deals with string
representations, but there are some differences:

[[5,'Bob',None,1.0]]

DSV.exportCSV produces

'5,Bob,None,1.0'

Data that doesn't need quoting isn't quoted. Assuming those were spreadsheet
values with the third item just an empty cell, then using Excel export rules
would result in a default CSV of

5,Bob,,1\r\n

None is just an empty field. In Excel, the number 1.0 is just 1 in the
exported file, but that may not matter, we can export 1.0 for the field.
This reminds me that the boundary case of the last record just having EOF
with no line ending should be tested.

Importing this line with importDSV for example yields a list of lists.

[['5', 'Bob', '', '1']]

Its debatable whether the third field should be None or an empty string.
Further interpretation of each field becomes application-specific. The API
makes it easy to do further processing as each row is read.

I'm still not sure about some of the database CSV handling issues, often it
seems they want a string field to be quoted regardless of whether it
contains a comma or newlines, but number and empty field should not be
quoted. It is certainly nice to be able to import a file that contains

5,"Bob",,1.0\r\n

and not need to do any further translation. Excel appears to interpret
quoted numbers and unquoted numbers as numeric fields when importing.

Just trying to be anal-retentive here to make sure all the issues are
covered ;-)

ka


From altis at semi-retired.com  Tue Jan 28 16:20:21 2003
From: altis at semi-retired.com (Kevin Altis)
Date: Tue, 28 Jan 2003 07:20:21 -0800
Subject: First Cut at CSV PEP
In-Reply-To: <15926.1287.36487.12649@montanaro.dyndns.org>
Message-ID: <KJEOLDOPMIDKCMJDCNDPKEPLCMAA.altis@semi-retired.com>

The big issue with the MS/Excel CSV format is that MS doesn't appear to
escape any characters or support import of escaped characters. A field that
contains characters that you might normally escape (including a comma if
that is the separator) are instead enclosed in double quotes by default and
then any double quotes in the field are doubled.

I found this MySQL article where the dialogs show the emphasis on escape
characters.

http://www.databasejournal.com/features/mysql/article.php/10897_1558731_5

It doesn't seem like you would run into a case where a file would use the MS
CSV format and have escaped characters too, but perhaps these exist in the
wild.

On the export, I think you would want the option of specifying whether to
use field qualifiers (quotes) on all fields and then only optionally enclose
a field if qualifiers are needed. If you aren't generating MS CSV format and
are using escape sequences, the field "quotes" aren't needed. See the Export
Data as CSV dialog at the URL above.

I guess MySQL could be one of the dialects and that would be closer to what
everyone expects except MS? Ugh, I shouldn't try and think about this stuff
before morning coffee ;-)

ka

> -----Original Message-----
> From: Skip Montanaro [mailto:skip at pobox.com]
> Sent: Monday, January 27, 2003 8:20 PM
> To: LogiplexSoftware at earthlink.net; altis at semi-retired.com;
> csv at object-craft.com.au
> Subject: First Cut at CSV PEP
>
>
>
> I'm ready to toddle off to bed, so I'm stopping here for tonight.
>  Attached
> is what I've come up with so far in the way of a PEP.  Feel free to flesh
> out, rewrite or add new sections.  After a brief amount of cycling, I'll
> check it into CVS.
>
> Skip
>
>


From altis at semi-retired.com  Tue Jan 28 16:50:53 2003
From: altis at semi-retired.com (Kevin Altis)
Date: Tue, 28 Jan 2003 07:50:53 -0800
Subject: more Perl CSV - http://tit.irk.ru/perlbookshelf/cookbook/ch01_16.htm
In-Reply-To: <15926.1287.36487.12649@montanaro.dyndns.org>
Message-ID: <KJEOLDOPMIDKCMJDCNDPOEPMCMAA.altis@semi-retired.com>


From skip at pobox.com  Tue Jan 28 17:56:26 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue, 28 Jan 2003 10:56:26 -0600
Subject: various CVS references
In-Reply-To: <KJEOLDOPMIDKCMJDCNDPGEPFCMAA.altis@semi-retired.com>
References: <15926.1287.36487.12649@montanaro.dyndns.org>
        <KJEOLDOPMIDKCMJDCNDPGEPFCMAA.altis@semi-retired.com>
Message-ID: <15926.46650.579273.539803@montanaro.dyndns.org>


    Kevin> Just for reference some Google searches of "cvs spec" "comma
    Kevin> separated values" and some other variants produced

Much appreciated.  I will incorporate some of them into the PEP.

    Kevin> Java
    Kevin> http://ostermiller.org/utils/CSVLexer.html

Interestingly enough, the author considers Excel's format not conformant
with "the generally accepted standards" and requires the programmer to use
special Excel readers and writers.  I wonder who he's been talking to about
standards. ;-)

    Kevin> Perl
    Kevin> http://rath.ca/Misc/Perl_CSV/

    Kevin> http://rath.ca/Misc/Perl_CSV/CSV-2.0.html#csv%20specification

I like that this guy has a BNF diagram for CSV files.  He treats delimiters
and quote characters as static, which we would probably make dynamic.
Perhaps I can come up with something similar for the PEP.  Kind of a Gory
Details appendix.

    Kevin> A search on CPAN for csv yields a lot of different modules, some
    Kevin> with test data.

    Kevin> http://theoryx5.uwinnipeg.ca/mod_perl/cpan-search?request=search

CPAN is great if you know what you're looking for but is a morass otherwise.
It gives you lots of choices, but not enough information to decide which
packages are high quality.  The Vaults of Parnassus has the same problem,
but fewer choices.

    Kevin> The TCL standard libs (whatever those are ;-) has a module

    Kevin> http://tcllib.sourceforge.net/doc/csv.html

Looks a bit low level.

    Kevin> MSDN references

    Kevin> http://msdn.microsoft.com/library/default.asp?url=/library/en-us/netdir/ad/comma-separated_value_csv_scripts.asp

Doesn't look all that useful.

    Kevin> http://tit.irk.ru/perlbookshelf/cookbook/ch01_16.htm

Interesting cookbook recipe, but nothing Dave and Cliff don't already know
how to do. ;-)  Besides, it uses regular expressions to parse fields.  As
Jamie Zawinski says:

    Some people, when confronted with a problem, think "I know, I'll use
    regular expressions."  Now they have two problems.

Skip


From altis at semi-retired.com  Tue Jan 28 18:02:06 2003
From: altis at semi-retired.com (Kevin Altis)
Date: Tue, 28 Jan 2003 09:02:06 -0800
Subject: various CVS references
In-Reply-To: <15926.46650.579273.539803@montanaro.dyndns.org>
Message-ID: <KJEOLDOPMIDKCMJDCNDPOEPPCMAA.altis@semi-retired.com>

> From: Skip Montanaro [mailto:skip at pobox.com]
>
>     Kevin> Just for reference some Google searches of "cvs spec" "comma
>     Kevin> separated values" and some other variants produced
>
> Much appreciated.  I will incorporate some of them into the PEP.

All this was just for reference sake so we have a better idea of current
practice in other languages. I have an email out to a .NET guru friend just
to see if MS has documented any better CSV as it relates to .NET methods in
various products.

I think we already understand the problem domain better than most and
realize that handling the MS format for both import and export out of the
gate is crucial for a standard lib.

ka


From LogiplexSoftware at earthlink.net  Tue Jan 28 22:17:32 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 28 Jan 2003 13:17:32 -0800
Subject: First Cut at CSV PEP
In-Reply-To: <m3d6mhrhx4.fsf@ferret.object-craft.com.au>
References: <15926.1287.36487.12649@montanaro.dyndns.org>
	 <m3d6mhrhx4.fsf@ferret.object-craft.com.au>
Message-ID: <1043788652.25139.3222.camel@software1.logiplex.internal>

On Mon, 2003-01-27 at 20:56, Dave Cole wrote:

> I only have one issue with the PEP as it stands.  It is still aiming
> too low.  One of the things that we support in our parser is the
> ability to handle CSV without quote characters.
> 
>         field1,field2,field3\, field3,field4
> 
> One of our customers has data like the above.  To handle this we would
> need something like the following:
> 
>     # Use the 'raw' dialect to get access to all tweakables.
>     writer(fileobj,
>            dialect='raw', quotechar=None, delimiter=',', escapechar='\\')

+1 on escapechar, -1 on 'raw' dialect.

Why would a 'raw' dialect be needed?  It isn't clear to me why
escapechar would be mutually exclusive with any particular dialect. 
Further, not specifying a dialect (dialect=None) should be the default
which would seem the same as 'raw'.

> I think that we need some way to handle a potentially different set of
> options on each dialect.

I'm not understanding how this is different from Skip's suggestion to
use

reader(fileobj, dialect="excel2000", delimiter='\t')

Or are you suggesting that not all options would be available on all
dialects?  Can you suggest an example?

> When you CSV export from Excel, do you have the ability to use a
> delimiter other than comma?  Do you have the ability to change the
> quotechar?

I think it is an option to save as a TSV file (IIRC), which is the same
as a CSV file, but with tabs.

> Should the wrapper protect you from yourself so that when you select
> the Excel dialect you are limited to the options available within
> Excel?

No.  I think this would be unnecessarily limiting.

> Maybe the dialect should not limit you, it should just provide the
> correct defaults.

This is what I'm thinking.

> Since we are going to have one parsing engine in an extension module
> below the Python layer, we are probably going to evolve more tweakable
> settings in the parser over time.  It would be nice if we could hide
> new tweakables from application code by associating defaults values
> with dialect names in the Python layer.  We should not be exposing the
> low level parser interface to user code if it can be avoided.

+1

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308


From LogiplexSoftware at earthlink.net  Tue Jan 28 22:25:17 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 28 Jan 2003 13:25:17 -0800
Subject: Checkin privileges for a few other people please? (fwd)
In-Reply-To: <15925.62932.544759.35012@montanaro.dyndns.org>
References: <15925.62932.544759.35012@montanaro.dyndns.org>
Message-ID: <1043789116.25146.3230.camel@software1.logiplex.internal>

On Mon, 2003-01-27 at 19:15, Skip Montanaro wrote:
> Hey folks,
> 
> Guido says it's a go if you're cool with the PSF license.  This will likely
> affect your current code.  Let me know, yea or nay.
> 
> Skip
> 

DSV is already listed under the Python license on SF, and even if it
weren't, I'd have no problem with this.
> 
> ______________________________________________________________________
> 
> From: Guido van Rossum <guido at python.org>
> To: skip at pobox.com
> Cc: Barry Warsaw <barry at wooz.org>, Fred Drake <fdrake at acm.org>, Jeremy Hylton <jeremy at zope.com>, Tim Peters <tim at zope.com>
> Subject: Re: Checkin privileges for a few other people please?
> Date: 27 Jan 2003 21:24:55 -0500
> 
> > I'm writing to see if you can give four people Python checkin privileges:
> > 
> >     who                         SF username
> >     ---                         -----------
> >     Kevin Altis                 kasplat
> >     Dave Cole                   davecole
> >     Andrew McNamara             andrewmcnamara
> >     Cliff Wells                 cliffwells18
> > 
> > We are launching on a PEP and a module to support reading and writing CSV
> > files.  Dave Cole, Andrew McNamara and Cliff Wells are authors of currently
> > available CSV packages (csv and Python-DSV - see Parnassus for pointers).
> > Kevin Altis is the author of PythonCard, and a user of CSV formats.  (I also
> > use CSV files a lot.)  All four have contributed substantially to the Python
> > community.
> > 
> > We're currently working on a PEP to define the API.  The current plan is to
> > build heavily on the Object Craft (Dave and Andrew) and Cliff's modules with
> > a more Pythonic API than either currently has.  I created a directory in the
> > sandbox just now to support this little mini-project.  The goal is to have
> > something which can be included in Python 2.3, though this may be a bit
> > optimistic, even with a substantial body of code already written.
> 
> I'd like to make sure that they will assign the copyright to the PSF.
> This is especially important since two of these are already authors of
> 3rd party code with possibly different licenses.  All new code in the
> Python CVS *must* be under the standard PSF license.
> 
> If they all agree with the drafts at
> 
> http://www.python.org/psf/psf-contributor-agreement.html
> 
> it's a deal, as far as I'm concerned.  (Oh, and the usual caution for
> checking in outside the area for which they are responsible.)
> 
> --Guido van Rossum (home page: http://www.python.org/~guido/)
-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308


From LogiplexSoftware at earthlink.net  Tue Jan 28 22:26:28 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 28 Jan 2003 13:26:28 -0800
Subject: SF ids please
In-Reply-To: <15925.58225.712028.494438@montanaro.dyndns.org>
References: <15925.58225.712028.494438@montanaro.dyndns.org>
Message-ID: <1043789188.25146.3232.camel@software1.logiplex.internal>

On Mon, 2003-01-27 at 17:57, Skip Montanaro wrote:
> Please confirm your Sourceforge usernames for me:
> 
>     Dave Cole                   davecole
>     Cliff Wells                 cliffwells18
>     Kevin Altis                 kasplat
> 
> I will see about getting you checkin privileges for Python CVS.  Dave, what
> about Andrew?

cliffwells18 confirmed =)

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308


From LogiplexSoftware at earthlink.net  Tue Jan 28 22:45:21 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 28 Jan 2003 13:45:21 -0800
Subject: First Cut at CSV PEP
In-Reply-To: <KJEOLDOPMIDKCMJDCNDPIEPECMAA.altis@semi-retired.com>
References: <KJEOLDOPMIDKCMJDCNDPIEPECMAA.altis@semi-retired.com>
Message-ID: <1043790321.25139.3251.camel@software1.logiplex.internal>

On Mon, 2003-01-27 at 21:50, Kevin Altis wrote:
> > From: Dave Cole
> >
> > >>>>> "Skip" == Skip Montanaro <skip at pobox.com> writes:
> >
> > I only have one issue with the PEP as it stands.  It is still aiming
> > too low.  One of the things that we support in our parser is the
> > ability to handle CSV without quote characters.
> >
> >         field1,field2,field3\, field3,field4
> 
> Excel certainly can't handle that, nor do I think Access can. If a field
> contains a comma, then the field must be quoted. Now, that isn't to say that
> we shouldn't be able to support the idea of escaped characters, but when
> exporting if you do want something that a tool like Excel could read, you
> would need to generate an exception if quoting wasn't specified. The same
> would probably apply for embedded newlines in a field without quoting.
> 
> Being able to generate exceptions on import and export operations could be
> one of the big benefits of this module. You won't accidentally export
> something that someone on the other end won't be able to use and you'll know
> on import that you have garbage before you try and use it. For example, when
> I first started trying to import Access data that was tab-separated, I
> didn't realize there were embedded newlines until much later, at which point
> I was able to go back and export as CSV with quote delimitters and the data
> became usable.

Perhaps a "strict" option?  I'm not sure this is necessary though.  It
seems that if a *programmer* specifies dialect="excel2000" and then
changes some other default, that's his problem.  There's a danger in too
much hand-holding in added complexity and arbitrary limitations.

> > I think that we need some way to handle a potentially different set of
> > options on each dialect.
> 
> I'm not real comfortable with the dialect idea, it doesn't seem to add any
> value over simply specifying a separator and delimiter.

Except that it gives a programmer a way to be certain that, if he does
nothing else, the file will be compatible with the specified dialect.

> We aren't dealing with encodings, so anything other than 7-bit ASCII unless
> specified as a delimiter or separator would be undefined, yes? The only
> thing that really matters is the delimiter and separator and then how
> quoting is handled of either of those characters and embedded returns and
> newlines within a field. Correct me if I'm wrong, but I don't think the MS
> CSV formats can deal with embedded CR or LF unless fields are quoted and
> that will be done with a " character.

But then MS isn't the only potential target, just our initial (and
primary) target.  foobar87 may allow export of escaped newlines and put
a extraneous space after every delimiter and we don't want someone to
have to write another csv importer to deal with it.

> Now with Access, you are actually given more control. See the attached
> screenshot. Ignorning everything except the top File format section you
> have:
> Delimited or Fixed Width. If Delimited you have a Field Delimiter choice of
> comma, semi-colon, tab and space or a user-specified character and the text
> qualifier can be double-quote, apostrophe, or None.

And this only deals with the variations the *user* is allowed to make. 
Applications themselves may introduce variations that we need to have
the flexibility to deal with.


> The universal readlines support in Python 2.3 may impact the use of a file
> reader/writer when processing different text files, but would returns or
> newlines within a field be impacted? Should the PEP and API specify that the
> record delimiter can be either CR, LF, or CR/LF, but use of those characters
> inside a field requires the field to be quoted or an exception will be
> thrown?

The idea of raising an exception brings up an interesting problem that I
had to deal with in DSV.  I've run across files that were missing fields
and just had a callback so the programmer could decide how to deal with
it.  This can be the result of corrupted data, but it's also possible
for an application to only export fields that actually contain data, for
instance:

1,2,3,4,5
1,2,3
1,2,3,4

This could very well be a valid csv file.  I'm not aware of any
requirement that rows all be the same length.  We'll need to have some
fairly flexible error-handling to allow for this type of thing when
required or raise an exception when it indicates corrupt/invalid data. 
In DSV I allowed custom error-handlers so the programmer could indicate
whether to process the line as normal, discard it, etc.


> ka
-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308


From skip at pobox.com  Tue Jan 28 22:55:12 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue, 28 Jan 2003 15:55:12 -0600
Subject: First Cut at CSV PEP
In-Reply-To: <KJEOLDOPMIDKCMJDCNDPMEPFCMAA.altis@semi-retired.com>
References: <15926.1287.36487.12649@montanaro.dyndns.org>
        <KJEOLDOPMIDKCMJDCNDPMEPFCMAA.altis@semi-retired.com>
Message-ID: <15926.64576.481489.373053@montanaro.dyndns.org>


    Kevin> Probably need to specify that input and output deals with string
    Kevin> representations, but there are some differences:

    Kevin> [[5,'Bob',None,1.0]]

    Kevin> DSV.exportCSV produces

    Kevin> '5,Bob,None,1.0'

I'm not so sure this mapping None to "None" on output is such a good idea
because it's not reversible in all situations and hurts portability to other
systems (e.g., does Excel have a concept of None? what happens if you have a
text field which just happens to contain "None"?).  I think we need to limit
the data which can be output to strings, Unicode strings (if we use an
encoded stream), floats and ints.  Anything else should raise TypeError.

    Kevin> I'm still not sure about some of the database CSV handling
    Kevin> issues, often it seems they want a string field to be quoted
    Kevin> regardless of whether it contains a comma or newlines, but number
    Kevin> and empty field should not be quoted. It is certainly nice to be
    Kevin> able to import a file that contains

    Kevin> 5,"Bob",,1.0\r\n

    Kevin> and not need to do any further translation. Excel appears to
    Kevin> interpret quoted numbers and unquoted numbers as numeric fields
    Kevin> when importing.

I like my CSV files to be fully quoted (even fields which may contain
numbers), largely because it makes later (dangerous) matching using regular
expressions simpler.  Otherwise I wind up having to make all the quotes in
the regular expressions optional.  It just complicates things.

    Kevin> Just trying to be anal-retentive here to make sure all the issues
    Kevin> are covered ;-)

I hear ya.

I just did a little fiddling in Excel 2000 with some simple values.  When I
save as CSV, it doesn't give me the option to change the delimiter or quote
character.  Nor could I figure out how to embed a newline in a cell.  It
certainly doesn't seem as flexible as Gnumeric in this regard.  Can someone
provide me with some hints?

Attached is a slight modification of the proto-PEP.  Really all that's
changed is the list of issues has grown.

Thx,

Skip

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/octet-stream
Size: 7138 bytes
Desc: not available
Url : http://mail.python.org/pipermail/csv/attachments/20030128/ce8a1d53/attachment.obj 

From skip at pobox.com  Tue Jan 28 23:02:37 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue, 28 Jan 2003 16:02:37 -0600
Subject: First Cut at CSV PEP
In-Reply-To: <KJEOLDOPMIDKCMJDCNDPIEPECMAA.altis@semi-retired.com>
References: <m3d6mhrhx4.fsf@ferret.object-craft.com.au>
        <KJEOLDOPMIDKCMJDCNDPIEPECMAA.altis@semi-retired.com>
Message-ID: <15926.65021.926324.438352@montanaro.dyndns.org>


    Kevin> I'm not real comfortable with the dialect idea, it doesn't seem
    Kevin> to add any value over simply specifying a separator and
    Kevin> delimiter.

I look at it as a simple way to specify a group of characteristics specific
to the way a vendor reads and writes CSV files.  It frees the programmer
from having to know all the characteristics of their chosen vendor's file
format.  Think of it as the difference between Larry Wall's configure script
for Perl and the GNU configure script.  When I configure Perl I have to know
enough about my system to know the alignment boundary of malloc, whether the
system is big- or little-endian, etc, even though I know damn well it can
figure that stuff out reliably.  GNU configure almost never prompts you.  It
reliably figures out all the low-level stuff for you.

Skip


From LogiplexSoftware at earthlink.net  Tue Jan 28 23:14:04 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 28 Jan 2003 14:14:04 -0800
Subject: First Cut at CSV PEP
In-Reply-To: <KJEOLDOPMIDKCMJDCNDPMEPFCMAA.altis@semi-retired.com>
References: <KJEOLDOPMIDKCMJDCNDPMEPFCMAA.altis@semi-retired.com>
Message-ID: <1043792044.14244.3280.camel@software1.logiplex.internal>

On Mon, 2003-01-27 at 22:43, Kevin Altis wrote:
> > I'm ready to toddle off to bed, so I'm stopping here for tonight.
> >  Attached
> > is what I've come up with so far in the way of a PEP.  Feel free to flesh
> > out, rewrite or add new sections.  After a brief amount of cycling, I'll
> > check it into CVS.
> 
> Probably need to specify that input and output deals with string
> representations, but there are some differences:
> 
> [[5,'Bob',None,1.0]]
> 
> DSV.exportCSV produces
> 
> '5,Bob,None,1.0'

Hm, that would be a bug in DSV =).  The None should have not been
exported (it doesn't have any meaning outside of Python).  However, only
quoting when necessary was lifted straight from Excel.  DSV also allows
a "quoteAll" option on export to change this behavior.

> Data that doesn't need quoting isn't quoted. Assuming those were spreadsheet
> values with the third item just an empty cell, then using Excel export rules
> would result in a default CSV of
> 
> 5,Bob,,1\r\n

This is the correct behavior.

> None is just an empty field. In Excel, the number 1.0 is just 1 in the
> exported file, but that may not matter, we can export 1.0 for the field.
> This reminds me that the boundary case of the last record just having EOF
> with no line ending should be tested.

Is this not handled correctly by all the existing implementations?

> Importing this line with importDSV for example yields a list of lists.
> 
> [['5', 'Bob', '', '1']]
> 
> Its debatable whether the third field should be None or an empty string.
> Further interpretation of each field becomes application-specific. The API
> makes it easy to do further processing as each row is read.

It's also debatable whether the numbers should have been returned as
strings or numbers.  I lean towards the former, as csv is a text format
and can't convey this sort of information by itself, which is why I
chose to return only strings, including the empty string for an empty
field rather than None.  I agree with Kevin that this is best left to
application logic rather than the module.

> I'm still not sure about some of the database CSV handling issues, often it
> seems they want a string field to be quoted regardless of whether it
> contains a comma or newlines, but number and empty field should not be
> quoted. It is certainly nice to be able to import a file that contains
> 
5,"Bob",,1.0\r\n
> 
> and not need to do any further translation. Excel appears to interpret
> quoted numbers and unquoted numbers as numeric fields when importing.

It treats them as if the user had typed them into a cell, which is not
necessarily the behavior we want.  To get a number as a string in Excel,
I imagine you'd have to have the following:

"""5""","Bob",,1.0\r\n

> 
> Just trying to be anal-retentive here to make sure all the issues are
> covered ;-)

And I thought it came naturally =)

> ka
-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308


From LogiplexSoftware at earthlink.net  Tue Jan 28 23:21:29 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 28 Jan 2003 14:21:29 -0800
Subject: First Cut at CSV PEP
In-Reply-To: <15926.64576.481489.373053@montanaro.dyndns.org>
References: <15926.1287.36487.12649@montanaro.dyndns.org>
	 <KJEOLDOPMIDKCMJDCNDPMEPFCMAA.altis@semi-retired.com>
	 <15926.64576.481489.373053@montanaro.dyndns.org>
Message-ID: <1043792488.25146.3288.camel@software1.logiplex.internal>

On Tue, 2003-01-28 at 13:55, Skip Montanaro wrote:
>     Kevin> Probably need to specify that input and output deals with string
>     Kevin> representations, but there are some differences:
> 
>     Kevin> [[5,'Bob',None,1.0]]
> 
>     Kevin> DSV.exportCSV produces
> 
>     Kevin> '5,Bob,None,1.0'
> 
> I'm not so sure this mapping None to "None" on output is such a good idea

Not unless bugs are good ideas ;)  Apparently the export stuff in DSV
isn't as widely used as this went unnoticed.  It is incorrect behavior.

> because it's not reversible in all situations and hurts portability to other
> systems (e.g., does Excel have a concept of None? what happens if you have a
> text field which just happens to contain "None"?).  I think we need to limit
> the data which can be output to strings, Unicode strings (if we use an
> encoded stream), floats and ints.  Anything else should raise TypeError.

Or be converted to a reasonable string alternative, ie None -> ''

>     Kevin> I'm still not sure about some of the database CSV handling
>     Kevin> issues, often it seems they want a string field to be quoted
>     Kevin> regardless of whether it contains a comma or newlines, but number
>     Kevin> and empty field should not be quoted. It is certainly nice to be
>     Kevin> able to import a file that contains
> 
>     Kevin> 5,"Bob",,1.0\r\n
> 
>     Kevin> and not need to do any further translation. Excel appears to
>     Kevin> interpret quoted numbers and unquoted numbers as numeric fields
>     Kevin> when importing.
> 
> I like my CSV files to be fully quoted (even fields which may contain
> numbers), largely because it makes later (dangerous) matching using regular
> expressions simpler.  Otherwise I wind up having to make all the quotes in
> the regular expressions optional.  It just complicates things.

Excel only quotes when necessary during export.  However, it doesn't
care on import which style is used.  Allowing the programmer to specify
the style in this regard would be a good thing.

>     Kevin> Just trying to be anal-retentive here to make sure all the issues
>     Kevin> are covered ;-)
> 
> I hear ya.
> 
> I just did a little fiddling in Excel 2000 with some simple values.  When I
> save as CSV, it doesn't give me the option to change the delimiter or quote
> character.  Nor could I figure out how to embed a newline in a cell.  It
> certainly doesn't seem as flexible as Gnumeric in this regard.  Can someone
> provide me with some hints?

Don't save as CSV, save as TSV, which is the same, but with tabs rather
than commas.  I don't know that it allows specifying the quote
character.

IIRC, you can embed a newline in a cell by entering " in a cell to mark
it as a string value, then I think you can then just hit enter (or
perhaps ctrl+enter).

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308


From skip at pobox.com  Tue Jan 28 23:48:28 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue, 28 Jan 2003 16:48:28 -0600
Subject: First Cut at CSV PEP
In-Reply-To: <1043790321.25139.3251.camel@software1.logiplex.internal>
References: <KJEOLDOPMIDKCMJDCNDPIEPECMAA.altis@semi-retired.com>
        <1043790321.25139.3251.camel@software1.logiplex.internal>
Message-ID: <15927.2236.560883.798099@montanaro.dyndns.org>


    Cliff> The idea of raising an exception brings up an interesting problem
    Cliff> that I had to deal with in DSV.  I've run across files that were
    Cliff> missing fields and just had a callback so the programmer could
    Cliff> decide how to deal with it.  This can be the result of corrupted
    Cliff> data, but it's also possible for an application to only export
    Cliff> fields that actually contain data, for instance:

    Cliff> 1,2,3,4,5
    Cliff> 1,2,3
    Cliff> 1,2,3,4

    Cliff> This could very well be a valid csv file.  I'm not aware of any
    Cliff> requirement that rows all be the same length.  

In fact, I think Excel itself will generate such files.  As I write this,
XEmacs on the Windows machine is displaying a CSV file I dumped in Excel
from an XLS file I got from someone (having nothing to do with the task at
hand).  It has seven rows of actual data, then 147 rows of commas.  The
comma-only rows have 13, 15 or 255 commas, nothing else.  The header line of
the CSV file has 15 fields with data and is terminated by a comma (empty
16th field).

In short, I don't think it's an error for CSV files to have rows of
differing lengths.  We just have to return what we are given and expect the
application is prepared to handle short rows.  We could add more flags, but
I think we should pause before we get too carried away with the flags.

I've added another issue to the proto-PEP:

    - How should rows of different lengths be handled?  The options seem
      to be::

      * raise an exception when a row is encountered whose length differs
        from the previous row

      * silently return short rows

      * allow the caller to specify the desired row length and what to do
        when rows of a different length are encountered: ignore, truncate,
        pad, raise exception, etc.

I don't think we have to address each and every issue before a first release
is made, BTW.

Skip


From LogiplexSoftware at earthlink.net  Tue Jan 28 23:50:49 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 28 Jan 2003 14:50:49 -0800
Subject: First Cut at CSV PEP
In-Reply-To: <15926.1287.36487.12649@montanaro.dyndns.org>
References: <15926.1287.36487.12649@montanaro.dyndns.org>
Message-ID: <1043794249.14244.3330.camel@software1.logiplex.internal>

As an aside, does anyone have any objection to prepending [CSV] to the
subject line of our emails on this topic?  Right now Kevin's mails are
going into the folder I have set aside for him and everyone else's is
going into my inbox which is making it somewhat tedious to follow. 
Prepending [CSV] would allow me to set up a filter and would make my
life just that much better =)

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308


From skip at pobox.com  Tue Jan 28 23:54:10 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue, 28 Jan 2003 16:54:10 -0600
Subject: First Cut at CSV PEP
In-Reply-To: <1043792044.14244.3280.camel@software1.logiplex.internal>
References: <KJEOLDOPMIDKCMJDCNDPMEPFCMAA.altis@semi-retired.com>
        <1043792044.14244.3280.camel@software1.logiplex.internal>
Message-ID: <15927.2578.699647.710265@montanaro.dyndns.org>


    Cliff> It's also debatable whether the numbers should have been returned
    Cliff> as strings or numbers.  I lean towards the former, as csv is a
    Cliff> text format and can't convey this sort of information by itself,
    Cliff> which is why I chose to return only strings, including the empty
    Cliff> string for an empty field rather than None.  I agree with Kevin
    Cliff> that this is best left to application logic rather than the
    Cliff> module.

I think returning strings is more Pythonic (explicit is better than
implicit), while returning numbers is more Perlish.  There's no particular
reason the user couldn't specify a set of type converters to filter the
input rows, e.g.:

    [int, int, str, mxDateTime.DateTimeFromString, ...]

but she could do that just as easily herself:

    reader = csv.reader(open("some.csv")):
    for row in reader:
        for i in range(min(len(rowtypes), len(row))):
            row[i] = rowtypes[i](row[i])

or something similar.  Here again we get into the sticky issue of row
length, suggesting we should just pass the buck to the caller.

Skip


From djc at object-craft.com.au  Tue Jan 28 23:59:29 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 29 Jan 2003 09:59:29 +1100
Subject: First Cut at CSV PEP
In-Reply-To: <KJEOLDOPMIDKCMJDCNDPKEPLCMAA.altis@semi-retired.com>
References: <KJEOLDOPMIDKCMJDCNDPKEPLCMAA.altis@semi-retired.com>
Message-ID: <m3bs20ubhq.fsf@ferret.object-craft.com.au>

>>>>> "Kevin" == Kevin Altis <altis at semi-retired.com> writes:

Kevin> The big issue with the MS/Excel CSV format is that MS doesn't
Kevin> appear to escape any characters or support import of escaped
Kevin> characters. A field that contains characters that you might
Kevin> normally escape (including a comma if that is the separator)
Kevin> are instead enclosed in double quotes by default and then any
Kevin> double quotes in the field are doubled.

I thought that we were trying to build a CSV parser which would deal
with different dialects, not just what Excel does.  Am I wrong making
that assumption?

If we were to only target Excel our task would be much easier.

I think that we should be trying to come up with an engine wrapped by
an friendly API which can be made more powerful over time in order to
parse more and more dialects.

Kevin> I found this MySQL article where the dialogs show the emphasis
Kevin> on escape characters.

Kevin> http://www.databasejournal.com/features/mysql/article.php/10897_1558731_5

Kevin> It doesn't seem like you would run into a case where a file
Kevin> would use the MS CSV format and have escaped characters too,
Kevin> but perhaps these exist in the wild.

There are CSV formats which do not use quote characters, they instead
escape the delimiters.

- Dave

-- 
http://www.object-craft.com.au


From djc at object-craft.com.au  Wed Jan 29 00:08:17 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 29 Jan 2003 10:08:17 +1100
Subject: First Cut at CSV PEP
In-Reply-To: <1043790321.25139.3251.camel@software1.logiplex.internal>
References: <KJEOLDOPMIDKCMJDCNDPIEPECMAA.altis@semi-retired.com>
	<1043790321.25139.3251.camel@software1.logiplex.internal>
Message-ID: <m3d6mgswim.fsf@ferret.object-craft.com.au>

>>>>> "Cliff" == Cliff Wells <LogiplexSoftware at earthlink.net> writes:

Cliff> But then MS isn't the only potential target, just our initial
Cliff> (and primary) target.  foobar87 may allow export of escaped
Cliff> newlines and put a extraneous space after every delimiter and
Cliff> we don't want someone to have to write another csv importer to
Cliff> deal with it.

I agree.  Excel compatibility is very important, but it is not the
only format we should be supporting.

>> The universal readlines support in Python 2.3 may impact the use of
>> a file reader/writer when processing different text files, but
>> would returns or newlines within a field be impacted? Should the
>> PEP and API specify that the record delimiter can be either CR, LF,
>> or CR/LF, but use of those characters inside a field requires the
>> field to be quoted or an exception will be thrown?

Interesting point.  I think that newlines inside records are going to
be the same as those separating records.  Anything else would be very
bizarre.

Cliff> The idea of raising an exception brings up an interesting
Cliff> problem that I had to deal with in DSV.  I've run across files
Cliff> that were missing fields and just had a callback so the
Cliff> programmer could decide how to deal with it.  This can be the
Cliff> result of corrupted data, but it's also possible for an
Cliff> application to only export fields that actually contain data,
Cliff> for instance:

Cliff> 1,2,3,4,5
Cliff> 1,2,3
Cliff> 1,2,3,4

I think that this is something which should be layer above the CSV
parser.  The technique for reading a CSV (from the PEP) looks like
this:

    csvreader = csv.parser(file("some.csv"))
    for row in csvreader:
        process(row)

Then any constraints on the content and structure of the records sits
logically in the process() function.

Cliff> This could very well be a valid csv file.  I'm not aware of any
Cliff> requirement that rows all be the same length.  We'll need to
Cliff> have some fairly flexible error-handling to allow for this type
Cliff> of thing when required or raise an exception when it indicates
Cliff> corrupt/invalid data.  In DSV I allowed custom error-handlers
Cliff> so the programmer could indicate whether to process the line as
Cliff> normal, discard it, etc.

I am convinced that this does not belong in the parser.

We can always keep going up in layers and build a csvutils module on
top of the parser.

- Dave

-- 
http://www.object-craft.com.au


From skip at pobox.com  Wed Jan 29 00:09:58 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue, 28 Jan 2003 17:09:58 -0600
Subject: First Cut at CSV PEP
In-Reply-To: <1043792488.25146.3288.camel@software1.logiplex.internal>
References: <15926.1287.36487.12649@montanaro.dyndns.org>
        <KJEOLDOPMIDKCMJDCNDPMEPFCMAA.altis@semi-retired.com>
        <15926.64576.481489.373053@montanaro.dyndns.org>
        <1043792488.25146.3288.camel@software1.logiplex.internal>
Message-ID: <15927.3526.657543.26339@montanaro.dyndns.org>


    Cliff> Don't save as CSV, save as TSV, which is the same, but with tabs
    Cliff> rather than commas.  I don't know that it allows specifying the
    Cliff> quote character.

Looking at the choices more closely, I see Excel has multiple tabular save
formats.  I just saved a simple sheet in each of the formats and scp'd it to
my laptop.  I'll check 'em out later.

    Cliff> IIRC, you can embed a newline in a cell by entering " in a cell
    Cliff> to mark it as a string value, then I think you can then just hit
    Cliff> enter (or perhaps ctrl+enter).

That didn't work, but I eventually figured out that ALT+ENTER allows you to
enter a "hard carriage return".

Skip


From LogiplexSoftware at earthlink.net  Wed Jan 29 00:11:16 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 28 Jan 2003 15:11:16 -0800
Subject: [CSV] Number of lines in CSV files
In-Reply-To: <15925.58225.712028.494438@montanaro.dyndns.org>
References: <15925.58225.712028.494438@montanaro.dyndns.org>
Message-ID: <1043795476.25146.3351.camel@software1.logiplex.internal>

Another thing that just occurred to me is that Excel has historically
been limited in the number of rows and columns that it can import.  This
number has increased with recent versions (I think it was 32K lines in
Excel 97, Kevin informs me it's 64K in Excel 2000).

Since export will be a feature of the CSV module, should we have some
sort of warning or raise an exception when exporting data larger than
the target application can handle, or should we just punt on this?

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308


From djc at object-craft.com.au  Wed Jan 29 00:11:19 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 29 Jan 2003 10:11:19 +1100
Subject: First Cut at CSV PEP
In-Reply-To: <15926.65021.926324.438352@montanaro.dyndns.org>
References: <m3d6mhrhx4.fsf@ferret.object-craft.com.au>
	<KJEOLDOPMIDKCMJDCNDPIEPECMAA.altis@semi-retired.com>
	<15926.65021.926324.438352@montanaro.dyndns.org>
Message-ID: <m38yx4swdk.fsf@ferret.object-craft.com.au>

>>>>> "Skip" == Skip Montanaro <skip at pobox.com> writes:

Kevin> I'm not real comfortable with the dialect idea, it doesn't seem
Kevin> to add any value over simply specifying a separator and
Kevin> delimiter.

Skip> I look at it as a simple way to specify a group of
Skip> characteristics specific to the way a vendor reads and writes
Skip> CSV files.  It frees the programmer from having to know all the
Skip> characteristics of their chosen vendor's file format.  Think of
Skip> it as the difference between Larry Wall's configure script for
Skip> Perl and the GNU configure script.  When I configure Perl I have
Skip> to know enough about my system to know the alignment boundary of
Skip> malloc, whether the system is big- or little-endian, etc, even
Skip> though I know damn well it can figure that stuff out reliably.
Skip> GNU configure almost never prompts you.  It reliably figures out
Skip> all the low-level stuff for you.

Yes, I agree.  Users of the module will probably want to be able to
handle files from specific applications without necessarily wanted to
go through the pain of learning the hard way about exactly how
dialects differ.  It is as Skip says, just like the autoconf stuff.

- Dave

-- 
http://www.object-craft.com.au


From skip at pobox.com  Wed Jan 29 00:12:52 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue, 28 Jan 2003 17:12:52 -0600
Subject: First Cut at CSV PEP
In-Reply-To: <1043794249.14244.3330.camel@software1.logiplex.internal>
References: <15926.1287.36487.12649@montanaro.dyndns.org>
        <1043794249.14244.3330.camel@software1.logiplex.internal>
Message-ID: <15927.3700.803751.757376@montanaro.dyndns.org>


    Cliff> As an aside, does anyone have any objection to prepending [CSV]
    Cliff> to the subject line of our emails on this topic?  

Nope.  I could set up a Mailman list on the Mojam server if you don't think
that's too much overkill.

Skip


From skip at pobox.com  Wed Jan 29 00:21:06 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue, 28 Jan 2003 17:21:06 -0600
Subject: Checkin privileges
Message-ID: <15927.4194.943233.439762@montanaro.dyndns.org>


I sent a second note to Guido about checkin privilege to the Python
repository.  All except Kevin (who said anon cvs was good enough for his
needs) should get access soon enough.  Don't forget, use caution if you
decide you need to make changes outside the csv sandbox.  (I doubt any of
you need reminding but figured I ought to be anal about it.)  Also, if
you're not already subscribed, I urge you to subscribe to python-dev.  The
signup page is on the Python website.  It will let you know generally what's
going on with the Python developer community.  You'll know when releases are
impending, etc.

Skip


From andrewm at object-craft.com.au  Wed Jan 29 00:28:03 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Wed, 29 Jan 2003 10:28:03 +1100
Subject: [CSV] Re: First Cut at CSV PEP 
In-Reply-To: Message from Dave Cole <djc@object-craft.com.au> 
   of "29 Jan 2003 10:08:17 +1100." <m3d6mgswim.fsf@ferret.object-craft.com.au> 
References: <KJEOLDOPMIDKCMJDCNDPIEPECMAA.altis@semi-retired.com> <1043790321.25139.3251.camel@software1.logiplex.internal>  <m3d6mgswim.fsf@ferret.object-craft.com.au> 
Message-ID: <20030128232803.C6A943C1F4@coffee.object-craft.com.au>

>>> The universal readlines support in Python 2.3 may impact the use of
>>> a file reader/writer when processing different text files, but
>>> would returns or newlines within a field be impacted? Should the
>>> PEP and API specify that the record delimiter can be either CR, LF,
>>> or CR/LF, but use of those characters inside a field requires the
>>> field to be quoted or an exception will be thrown?
>
>Interesting point.  I think that newlines inside records are going to
>be the same as those separating records.  Anything else would be very
>bizarre.

You should know better than to make a statement like that where Microsoft
is concerned. Excel uses a single LF within fields, but CRLF at the end
of lines. If you import a field containing CRLF, the CR appears within
the field as a box (the "unprintable character" symbol).

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/


From djc at object-craft.com.au  Wed Jan 29 00:28:49 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 29 Jan 2003 10:28:49 +1100
Subject: First Cut at CSV PEP
In-Reply-To: <KJEOLDOPMIDKCMJDCNDPIEPECMAA.altis@semi-retired.com>
References: <KJEOLDOPMIDKCMJDCNDPIEPECMAA.altis@semi-retired.com>
Message-ID: <m34r7ssvke.fsf@ferret.object-craft.com.au>

>>>>> "Kevin" == Kevin Altis <altis at semi-retired.com> writes:

>> From: Dave Cole
>> 
>> >>>>> "Skip" == Skip Montanaro <skip at pobox.com> writes:
>> 
>> I only have one issue with the PEP as it stands.  It is still
>> aiming too low.  One of the things that we support in our parser is
>> the ability to handle CSV without quote characters.
>> 
>> field1,field2,field3\, field3,field4

Kevin> Excel certainly can't handle that, nor do I think Access
Kevin> can. If a field contains a comma, then the field must be
Kevin> quoted. Now, that isn't to say that we shouldn't be able to
Kevin> support the idea of escaped characters, but when exporting if
Kevin> you do want something that a tool like Excel could read, you
Kevin> would need to generate an exception if quoting wasn't
Kevin> specified. The same would probably apply for embedded newlines
Kevin> in a field without quoting.

Kevin> Being able to generate exceptions on import and export
Kevin> operations could be one of the big benefits of this module. You
Kevin> won't accidentally export something that someone on the other
Kevin> end won't be able to use and you'll know on import that you
Kevin> have garbage before you try and use it. For example, when I
Kevin> first started trying to import Access data that was
Kevin> tab-separated, I didn't realize there were embedded newlines
Kevin> until much later, at which point I was able to go back and
Kevin> export as CSV with quote delimitters and the data became
Kevin> usable.

I suppose that exporting should raise an exception if you specify any
variation on the dialect in the writer function.

    csvwriter = csv.writer(file("newnastiness.csv", "w"),
                           dialect='excel2000', delimiter='"')

That should raise an exception.

This probably shouldn't raise an exception though:

    csvwriter = csv.writer(file("newnastiness.csv", "w"),
                           dialect='excel2000')
    csvwriter.setparams(delimiter='"')

>> I think that we need some way to handle a potentially different set
>> of options on each dialect.

Kevin> I'm not real comfortable with the dialect idea, it doesn't seem
Kevin> to add any value over simply specifying a separator and
Kevin> delimiter.

It makes thing *a lot* easier for module users who are not fully
conversant in the vagaries of CSV.

Kevin> We aren't dealing with encodings, so anything other than 7-bit
Kevin> ASCII unless specified as a delimiter or separator would be
Kevin> undefined, yes? The only thing that really matters is the
Kevin> delimiter and separator and then how quoting is handled of
Kevin> either of those characters and embedded returns and newlines
Kevin> within a field. Correct me if I'm wrong, but I don't think the
Kevin> MS CSV formats can deal with embedded CR or LF unless fields
Kevin> are quoted and that will be done with a " character.

We are not just trying to deal with MS CSV formats though.

Kevin> Note If your workbook contains special font characters such as
Kevin> a copyright symbol (C), and you will be using the converted
Kevin> text file on a computer with a different operating system, save
Kevin> the workbook in the text file format appropriate for that
Kevin> system. For example, if you are using Windows and want to use
Kevin> the text file on a Macintosh computer, save the file in the CSV
Kevin> (Macintosh) format. If you are using a Macintosh computer and
Kevin> want to use the text file on a system running Windows or
Kevin> Windows NT, save the file in the CSV (Windows) format."

Kevin> The CR, CR/LF, and LF line endings probably have something to
Kevin> do with saving in Mac format, but it may also do some 8-bit
Kevin> character translation.

Should we be trying to handle unicode.  I think we should since Python
is now unicode capable.

Kevin> The universal readlines support in Python 2.3 may impact the
Kevin> use of a file reader/writer when processing different text
Kevin> files, but would returns or newlines within a field be
Kevin> impacted? Should the PEP and API specify that the record
Kevin> delimiter can be either CR, LF, or CR/LF, but use of those
Kevin> characters inside a field requires the field to be quoted or an
Kevin> exception will be thrown?

Should we raise an exception or just pass the data through?

If it is not a newline, then it is not a newline.

- Dave

-- 
http://www.object-craft.com.au


From andrewm at object-craft.com.au  Wed Jan 29 00:39:47 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Wed, 29 Jan 2003 10:39:47 +1100
Subject: [CSV] Re: First Cut at CSV PEP 
In-Reply-To: Message from Dave Cole <djc@object-craft.com.au> 
   of "29 Jan 2003 10:28:49 +1100." <m34r7ssvke.fsf@ferret.object-craft.com.au> 
References: <KJEOLDOPMIDKCMJDCNDPIEPECMAA.altis@semi-retired.com>  <m34r7ssvke.fsf@ferret.object-craft.com.au> 
Message-ID: <20030128233947.6C5593C1F4@coffee.object-craft.com.au>

>I suppose that exporting should raise an exception if you specify any
>variation on the dialect in the writer function.
>
>    csvwriter = csv.writer(file("newnastiness.csv", "w"),
>                           dialect='excel2000', delimiter='"')
>
>That should raise an exception.

You mean "raise an exception because the result would be ambiguous", or
"raise an exception because it's not excel2000"?

BTW, I don't have access to Excel 2000, only 97. I'm going to assume
they're the same until proven otherwise (bad assumption, I know).

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/


From djc at object-craft.com.au  Wed Jan 29 00:43:33 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 29 Jan 2003 10:43:33 +1100
Subject: First Cut at CSV PEP
In-Reply-To: <1043788652.25139.3222.camel@software1.logiplex.internal>
References: <15926.1287.36487.12649@montanaro.dyndns.org>
	<m3d6mhrhx4.fsf@ferret.object-craft.com.au>
	<1043788652.25139.3222.camel@software1.logiplex.internal>
Message-ID: <m3znpkrgbe.fsf@ferret.object-craft.com.au>

>>>>> "Cliff" == Cliff Wells <LogiplexSoftware at earthlink.net> writes:

Cliff> On Mon, 2003-01-27 at 20:56, Dave Cole wrote:
>> I only have one issue with the PEP as it stands.  It is still
>> aiming too low.  One of the things that we support in our parser is
>> the ability to handle CSV without quote characters.
>> 
>> field1,field2,field3\, field3,field4
>> 
>> One of our customers has data like the above.  To handle this we
>> would need something like the following:
>> 
>> # Use the 'raw' dialect to get access to all tweakables.
>> writer(fileobj, dialect='raw', quotechar=None, delimiter=',',
>> escapechar='\\')

Cliff> +1 on escapechar, -1 on 'raw' dialect.

See below.

Cliff> Why would a 'raw' dialect be needed?  It isn't clear to me why
Cliff> escapechar would be mutually exclusive with any particular
Cliff> dialect.  Further, not specifying a dialect (dialect=None)
Cliff> should be the default which would seem the same as 'raw'.

>> I think that we need some way to handle a potentially different set
>> of options on each dialect.

Cliff> I'm not understanding how this is different from Skip's
Cliff> suggestion to use

Cliff> reader(fileobj, dialect="excel2000", delimiter='\t')

Cliff> Or are you suggesting that not all options would be available
Cliff> on all dialects?  Can you suggest an example?

I think it is important to keep in mind the users of the module who
are not expert in the various dialects of CSV.  If presented with a
flat list of all options supported they are going to engage in a fair
amount of head scratching.

If we try to make things easier for users by mirroring the options
that their application presents then they are going to have a much
easier time working out how to use the module for their specific
problem.  By limiting the available options based upon the dialect
specified by the user we will be doing them a favour.

The point of the 'raw' dialect is to expose the full capabilities of
the raw parser.  Maybe we should use None rather than 'raw'.

>> When you CSV export from Excel, do you have the ability to use a
>> delimiter other than comma?  Do you have the ability to change the
>> quotechar?

Cliff> I think it is an option to save as a TSV file (IIRC), which is
Cliff> the same as a CSV file, but with tabs.

Hmm...  What would be the best way to handle Excel TSV.  Maybe a new
dialect 'excel-tsv'?

>> Should the wrapper protect you from yourself so that when you
>> select the Excel dialect you are limited to the options available
>> within Excel?

Cliff> No.  I think this would be unnecessarily limiting.

I am not saying that the wrapper should absolutely prevent someone
from using options not available in the application.  If you want to
break the dialect then maybe it should be a two step process.

    csvwriter = csv.writer(file("newnastiness.csv", "w"),
                           dialect='excel2000')
    csvwriter.setparams(delimiter='"')

- Dave

-- 
http://www.object-craft.com.au


From djc at object-craft.com.au  Wed Jan 29 00:59:44 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 29 Jan 2003 10:59:44 +1100
Subject: First Cut at CSV PEP
In-Reply-To: <15926.64576.481489.373053@montanaro.dyndns.org>
References: <15926.1287.36487.12649@montanaro.dyndns.org>
	<KJEOLDOPMIDKCMJDCNDPMEPFCMAA.altis@semi-retired.com>
	<15926.64576.481489.373053@montanaro.dyndns.org>
Message-ID: <m3lm14rfkf.fsf@ferret.object-craft.com.au>

>>>>> "Skip" == Skip Montanaro <skip at pobox.com> writes:

Kevin> Probably need to specify that input and output deals with
Kevin> string representations, but there are some differences:

Kevin> [[5,'Bob',None,1.0]]

Kevin> DSV.exportCSV produces

Kevin> '5,Bob,None,1.0'

Skip> I'm not so sure this mapping None to "None" on output is such a
Skip> good idea because it's not reversible in all situations and
Skip> hurts portability to other systems (e.g., does Excel have a
Skip> concept of None? what happens if you have a text field which
Skip> just happens to contain "None"?).

I think that None should always be written as a zero length field, and
always read as the field value 'None'

Skip> I think we need to limit the data which can be output to
Skip> strings, Unicode strings (if we use an encoded stream), floats
Skip> and ints.  Anything else should raise TypeError.

Is there any merit having the writer handling non-string data by
producing an empty field for None, and the result of PyObject_Str()
for all other values?

Skip> I like my CSV files to be fully quoted (even fields which may
Skip> contain numbers), largely because it makes later (dangerous)
Skip> matching using regular expressions simpler.  Otherwise I wind up
Skip> having to make all the quotes in the regular expressions
Skip> optional.  It just complicates things.

That raises another implementation issue.  If you export from Excel,
does it always quote fields?  If not then the default dialect
behaviour should not unconditionally quote fields.

We could/should support mandatoryquote as a writer option.

I am going to spend some time tonight seeing if I can fold all of my
ideas into the PEP so you can all poke holes in it.

- Dave

-- 
http://www.object-craft.com.au


From djc at object-craft.com.au  Wed Jan 29 01:02:15 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 29 Jan 2003 11:02:15 +1100
Subject: First Cut at CSV PEP
In-Reply-To: <1043792044.14244.3280.camel@software1.logiplex.internal>
References: <KJEOLDOPMIDKCMJDCNDPMEPFCMAA.altis@semi-retired.com>
	<1043792044.14244.3280.camel@software1.logiplex.internal>
Message-ID: <m3hebsrfg8.fsf@ferret.object-craft.com.au>

>>>>> "Cliff" == Cliff Wells <LogiplexSoftware at earthlink.net> writes:

Cliff> It's also debatable whether the numbers should have been
Cliff> returned as strings or numbers.  I lean towards the former, as
Cliff> csv is a text format and can't convey this sort of information
Cliff> by itself, which is why I chose to return only strings,
Cliff> including the empty string for an empty field rather than None.
Cliff> I agree with Kevin that this is best left to application logic
Cliff> rather than the module.

Yes.

- Dave

-- 
http://www.object-craft.com.au


From LogiplexSoftware at earthlink.net  Wed Jan 29 01:03:45 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 28 Jan 2003 16:03:45 -0800
Subject: [CSV] Re: First Cut at CSV PEP
In-Reply-To: <m34r7ssvke.fsf@ferret.object-craft.com.au>
References: <KJEOLDOPMIDKCMJDCNDPIEPECMAA.altis@semi-retired.com>
	 <m34r7ssvke.fsf@ferret.object-craft.com.au>
Message-ID: <1043798625.25139.3395.camel@software1.logiplex.internal>

On Tue, 2003-01-28 at 15:28, Dave Cole wrote:

> I suppose that exporting should raise an exception if you specify any
> variation on the dialect in the writer function.
> 
>     csvwriter = csv.writer(file("newnastiness.csv", "w"),
>                            dialect='excel2000', delimiter='"')
> 
> That should raise an exception.

I still don't see a good reason for this.  The programmer asked for it,
let her do it.  I don't see a problem with letting the programmer shoot
herself in the foot, as long as the gun doesn't start out pointing at
it.

> This probably shouldn't raise an exception though:
> 
>     csvwriter = csv.writer(file("newnastiness.csv", "w"),
>                            dialect='excel2000')
>     csvwriter.setparams(delimiter='"')

While this provides a workaround, it also seems a bit non-obvious why
this should work when passing delimiter as an argument raises an
exception.  I'm not dead-set against it, its JMHO.

> >> I think that we need some way to handle a potentially different set
> >> of options on each dialect.
> 
> Kevin> I'm not real comfortable with the dialect idea, it doesn't seem
> Kevin> to add any value over simply specifying a separator and
> Kevin> delimiter.
> 
> It makes thing *a lot* easier for module users who are not fully
> conversant in the vagaries of CSV.

I agree.

> Kevin> The CR, CR/LF, and LF line endings probably have something to
> Kevin> do with saving in Mac format, but it may also do some 8-bit
> Kevin> character translation.
> 
> Should we be trying to handle unicode.  I think we should since Python
> is now unicode capable.

What issues is unicode support going to raise?

> Kevin> The universal readlines support in Python 2.3 may impact the
> Kevin> use of a file reader/writer when processing different text
> Kevin> files, but would returns or newlines within a field be
> Kevin> impacted? Should the PEP and API specify that the record
> Kevin> delimiter can be either CR, LF, or CR/LF, but use of those
> Kevin> characters inside a field requires the field to be quoted or an
> Kevin> exception will be thrown?
> 
> Should we raise an exception or just pass the data through?
> 
> If it is not a newline, then it is not a newline.

This seems like a particularly intractable problem.  If an file can't
decide what sort of newlines it is going to use, then I'm not convinced
it's the parser's problem.  

So the question becomes whether to except or pass through.  The two
things to consider in this case are:

1)  The data might be correct, in which case it should be passed through
2)  The target for the data might be someone's mission-critical SQL
server and we don't want to help them mung up their data.  An exception
would seem appropriate.

Frankly, I think I lean towards an exception on this one.  There are
enough text-processing tools available (dos2unix and kin) that someone
should be able to pre-process a CSV file that is raising exceptions and
get it into a form acceptable to the parser.  A little work up front is
far more acceptable than putting out a fire on someone's database.

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308


From djc at object-craft.com.au  Wed Jan 29 01:03:52 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 29 Jan 2003 11:03:52 +1100
Subject: First Cut at CSV PEP
In-Reply-To: <15927.3700.803751.757376@montanaro.dyndns.org>
References: <15926.1287.36487.12649@montanaro.dyndns.org>
	<1043794249.14244.3330.camel@software1.logiplex.internal>
	<15927.3700.803751.757376@montanaro.dyndns.org>
Message-ID: <m3d6mgrfdj.fsf@ferret.object-craft.com.au>

>>>>> "Skip" == Skip Montanaro <skip at pobox.com> writes:

Cliff> As an aside, does anyone have any objection to prepending [CSV]
Cliff> to the subject line of our emails on this topic?

Skip> Nope.  I could set up a Mailman list on the Mojam server if you
Skip> don't think that's too much overkill.

Do it.  We can then use URL's to old messages.

- Dave

-- 
http://www.object-craft.com.au


From djc at object-craft.com.au  Wed Jan 29 01:04:32 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 29 Jan 2003 11:04:32 +1100
Subject: [CSV] Re: First Cut at CSV PEP
In-Reply-To: <20030128232803.C6A943C1F4@coffee.object-craft.com.au>
References: <KJEOLDOPMIDKCMJDCNDPIEPECMAA.altis@semi-retired.com>
	<1043790321.25139.3251.camel@software1.logiplex.internal>
	<m3d6mgswim.fsf@ferret.object-craft.com.au>
	<20030128232803.C6A943C1F4@coffee.object-craft.com.au>
Message-ID: <m38yx4rfcf.fsf@ferret.object-craft.com.au>

>>>>> "Andrew" == Andrew McNamara <andrewm at object-craft.com.au> writes:

>>>> The universal readlines support in Python 2.3 may impact the use
>>>> of a file reader/writer when processing different text files, but
>>>> would returns or newlines within a field be impacted? Should the
>>>> PEP and API specify that the record delimiter can be either CR,
>>>> LF, or CR/LF, but use of those characters inside a field requires
>>>> the field to be quoted or an exception will be thrown?
>>  Interesting point.  I think that newlines inside records are going
>> to be the same as those separating records.  Anything else would be
>> very bizarre.

Andrew> You should know better than to make a statement like that
Andrew> where Microsoft is concerned. Excel uses a single LF within
Andrew> fields, but CRLF at the end of lines. If you import a field
Andrew> containing CRLF, the CR appears within the field as a box (the
Andrew> "unprintable character" symbol).

Touche :-)

- Dave

-- 
http://www.object-craft.com.au


From djc at object-craft.com.au  Wed Jan 29 01:07:18 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 29 Jan 2003 11:07:18 +1100
Subject: [CSV] Re: First Cut at CSV PEP
In-Reply-To: <20030128233947.6C5593C1F4@coffee.object-craft.com.au>
References: <KJEOLDOPMIDKCMJDCNDPIEPECMAA.altis@semi-retired.com>
	<m34r7ssvke.fsf@ferret.object-craft.com.au>
	<20030128233947.6C5593C1F4@coffee.object-craft.com.au>
Message-ID: <m34r7srf7t.fsf@ferret.object-craft.com.au>

>>>>> "Andrew" == Andrew McNamara <andrewm at object-craft.com.au> writes:

>> I suppose that exporting should raise an exception if you specify
>> any variation on the dialect in the writer function.
>> 
>> csvwriter = csv.writer(file("newnastiness.csv", "w"),
>> dialect='excel2000', delimiter='"')
>> 
>> That should raise an exception.

Andrew> You mean "raise an exception because the result would be
Andrew> ambiguous", or "raise an exception because it's not
Andrew> excel2000"?

Because it is not 'excel2000'.

Andrew> BTW, I don't have access to Excel 2000, only 97. I'm going to
Andrew> assume they're the same until proven otherwise (bad
Andrew> assumption, I know).

This is a prime example of why we should support dialects.

- Dave

-- 
http://www.object-craft.com.au


From djc at object-craft.com.au  Wed Jan 29 01:08:16 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 29 Jan 2003 11:08:16 +1100
Subject: [CSV] Number of lines in CSV files
In-Reply-To: <1043795476.25146.3351.camel@software1.logiplex.internal>
References: <15925.58225.712028.494438@montanaro.dyndns.org>
	<1043795476.25146.3351.camel@software1.logiplex.internal>
Message-ID: <m3znpkq0lr.fsf@ferret.object-craft.com.au>

>>>>> "Cliff" == Cliff Wells <LogiplexSoftware at earthlink.net> writes:

Cliff> Another thing that just occurred to me is that Excel has
Cliff> historically been limited in the number of rows and columns
Cliff> that it can import.  This number has increased with recent
Cliff> versions (I think it was 32K lines in Excel 97, Kevin informs
Cliff> me it's 64K in Excel 2000).

Cliff> Since export will be a feature of the CSV module, should we
Cliff> have some sort of warning or raise an exception when exporting
Cliff> data larger than the target application can handle, or should
Cliff> we just punt on this?

Arrrgggg.  My brain just dribbled out of my ears...

- Dave

-- 
http://www.object-craft.com.au


From andrewm at object-craft.com.au  Wed Jan 29 01:08:16 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Wed, 29 Jan 2003 11:08:16 +1100
Subject: [CSV] Re: First Cut at CSV PEP 
In-Reply-To: Message from Dave Cole <djc@object-craft.com.au> 
   of "29 Jan 2003 10:43:33 +1100." <m3znpkrgbe.fsf@ferret.object-craft.com.au> 
References: <15926.1287.36487.12649@montanaro.dyndns.org> <m3d6mhrhx4.fsf@ferret.object-craft.com.au> <1043788652.25139.3222.camel@software1.logiplex.internal>  <m3znpkrgbe.fsf@ferret.object-craft.com.au> 
Message-ID: <20030129000816.2C9153C1F4@coffee.object-craft.com.au>

>I think it is important to keep in mind the users of the module who
>are not expert in the various dialects of CSV.  If presented with a
>flat list of all options supported they are going to engage in a fair
>amount of head scratching.
>
>If we try to make things easier for users by mirroring the options
>that their application presents then they are going to have a much
>easier time working out how to use the module for their specific
>problem.  By limiting the available options based upon the dialect
>specified by the user we will be doing them a favour.
>
>The point of the 'raw' dialect is to expose the full capabilities of
>the raw parser.  Maybe we should use None rather than 'raw'.

My feeling is that this simply changes the shape of the complexity
without really helping.

I think we should just stick with the "a dialect is a set of defaults"
idea.

>Hmm...  What would be the best way to handle Excel TSV.  Maybe a new
>dialect 'excel-tsv'?

When saving, Excel97 calls this "Text (Tab delimited)", so maybe
"excel-tab" would be clear enough. CSV is "CSV (Comma delimited)".

On import, it seems to just guess what the file is - I couldn't see a way
under Excel97 to specify.

>I am not saying that the wrapper should absolutely prevent someone
>from using options not available in the application.  If you want to
>break the dialect then maybe it should be a two step process.
>
>    csvwriter = csv.writer(file("newnastiness.csv", "w"),
>                           dialect='excel2000')
>    csvwriter.setparams(delimiter='"')

This strikes me as B&D.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/


From andrewm at object-craft.com.au  Wed Jan 29 01:15:44 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Wed, 29 Jan 2003 11:15:44 +1100
Subject: [CSV] Re: First Cut at CSV PEP 
In-Reply-To: Message from Dave Cole <djc@object-craft.com.au> 
   of "29 Jan 2003 11:07:18 +1100." <m34r7srf7t.fsf@ferret.object-craft.com.au> 
References: <KJEOLDOPMIDKCMJDCNDPIEPECMAA.altis@semi-retired.com> <m34r7ssvke.fsf@ferret.object-craft.com.au> <20030128233947.6C5593C1F4@coffee.object-craft.com.au>  <m34r7srf7t.fsf@ferret.object-craft.com.au> 
Message-ID: <mailman.2.1159099090.10888.csv@python.org>

>>> That should raise an exception.
>
>Andrew> You mean "raise an exception because the result would be
>Andrew> ambiguous", or "raise an exception because it's not
>Andrew> excel2000"?
>
>Because it is not 'excel2000'.

I don't like it, as I mentioned in my previous e-mail. Excel (97, at
least) doesn't let you tweak and tune, so *any* non-default settings are
"not excel".

A better idea would be to have the dialect turn on "strict_blah" if it's
thought necessary.

But we still need to raise exceptions on nonsense formats (like using
quote as a field separator while also using it as the quote character).

>Andrew> BTW, I don't have access to Excel 2000, only 97. I'm going to
>Andrew> assume they're the same until proven otherwise (bad
>Andrew> assumption, I know).
>
>This is a prime example of why we should support dialects.

And every dialect should be supported by a wad of tests... 8-)

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/


From andrewm at object-craft.com.au  Wed Jan 29 01:15:44 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Wed, 29 Jan 2003 11:15:44 +1100
Subject: [CSV] Re: First Cut at CSV PEP 
In-Reply-To: Message from Dave Cole <djc@object-craft.com.au> 
   of "29 Jan 2003 11:07:18 +1100." <m34r7srf7t.fsf@ferret.object-craft.com.au> 
References: <KJEOLDOPMIDKCMJDCNDPIEPECMAA.altis@semi-retired.com> <m34r7ssvke.fsf@ferret.object-craft.com.au> <20030128233947.6C5593C1F4@coffee.object-craft.com.au>  <m34r7srf7t.fsf@ferret.object-craft.com.au> 
Message-ID: <20030129001544.0FD133C1F4@coffee.object-craft.com.au>

>>> That should raise an exception.
>
>Andrew> You mean "raise an exception because the result would be
>Andrew> ambiguous", or "raise an exception because it's not
>Andrew> excel2000"?
>
>Because it is not 'excel2000'.

I don't like it, as I mentioned in my previous e-mail. Excel (97, at
least) doesn't let you tweak and tune, so *any* non-default settings are
"not excel".

A better idea would be to have the dialect turn on "strict_blah" if it's
thought necessary.

But we still need to raise exceptions on nonsense formats (like using
quote as a field separator while also using it as the quote character).

>Andrew> BTW, I don't have access to Excel 2000, only 97. I'm going to
>Andrew> assume they're the same until proven otherwise (bad
>Andrew> assumption, I know).
>
>This is a prime example of why we should support dialects.

And every dialect should be supported by a wad of tests... 8-)

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/


From LogiplexSoftware at earthlink.net  Wed Jan 29 01:15:49 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 28 Jan 2003 16:15:49 -0800
Subject: [CSV] Number of lines in CSV files
In-Reply-To: <m3znpkq0lr.fsf@ferret.object-craft.com.au>
References: <15925.58225.712028.494438@montanaro.dyndns.org>
	 <1043795476.25146.3351.camel@software1.logiplex.internal>
	 <m3znpkq0lr.fsf@ferret.object-craft.com.au>
Message-ID: <1043799349.25146.3400.camel@software1.logiplex.internal>

On Tue, 2003-01-28 at 16:08, Dave Cole wrote:
> >>>>> "Cliff" == Cliff Wells <LogiplexSoftware at earthlink.net> writes:
> 
> Cliff> Another thing that just occurred to me is that Excel has
> Cliff> historically been limited in the number of rows and columns
> Cliff> that it can import.  This number has increased with recent
> Cliff> versions (I think it was 32K lines in Excel 97, Kevin informs
> Cliff> me it's 64K in Excel 2000).
> 
> Cliff> Since export will be a feature of the CSV module, should we
> Cliff> have some sort of warning or raise an exception when exporting
> Cliff> data larger than the target application can handle, or should
> Cliff> we just punt on this?
> 
> Arrrgggg.  My brain just dribbled out of my ears...

So, +1 on punt? <wink>

Actually I think this particular aspect would be fairly simple to
handle.  Another attribute of a dialect could be sizelimits = (maxrows,
maxcols) and set to (None, None) if the programmer doesn't care or just
wants to bypass that check.

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308


From djc at object-craft.com.au  Wed Jan 29 01:15:56 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 29 Jan 2003 11:15:56 +1100
Subject: [CSV] Re: First Cut at CSV PEP
In-Reply-To: <20030129000816.2C9153C1F4@coffee.object-craft.com.au>
References: <15926.1287.36487.12649@montanaro.dyndns.org>
	<m3d6mhrhx4.fsf@ferret.object-craft.com.au>
	<1043788652.25139.3222.camel@software1.logiplex.internal>
	<m3znpkrgbe.fsf@ferret.object-craft.com.au>
	<20030129000816.2C9153C1F4@coffee.object-craft.com.au>
Message-ID: <m3ptqgq08z.fsf@ferret.object-craft.com.au>

>>>>> "Andrew" == Andrew McNamara <andrewm at object-craft.com.au> writes:

>> I think it is important to keep in mind the users of the module who
>> are not expert in the various dialects of CSV.  If presented with a
>> flat list of all options supported they are going to engage in a
>> fair amount of head scratching.
>> 
>> If we try to make things easier for users by mirroring the options
>> that their application presents then they are going to have a much
>> easier time working out how to use the module for their specific
>> problem.  By limiting the available options based upon the dialect
>> specified by the user we will be doing them a favour.
>> 
>> The point of the 'raw' dialect is to expose the full capabilities
>> of the raw parser.  Maybe we should use None rather than 'raw'.

Andrew> My feeling is that this simply changes the shape of the
Andrew> complexity without really helping.

Andrew> I think we should just stick with the "a dialect is a set of
Andrew> defaults" idea.

Fair enough.

Instead of limiting the tweakable options by raising an exception we
could have an interface which allowed the user to query the options
normally associated with a dialect.

>> Hmm...  What would be the best way to handle Excel TSV.  Maybe a
>> new dialect 'excel-tsv'?

Andrew> When saving, Excel97 calls this "Text (Tab delimited)", so
Andrew> maybe "excel-tab" would be clear enough. CSV is "CSV (Comma
Andrew> delimited)".

Yup.

Andrew> On import, it seems to just guess what the file is - I
Andrew> couldn't see a way under Excel97 to specify.

Some kind of sniffing going on.

Should we have a sniffer in the module?

>> I am not saying that the wrapper should absolutely prevent someone
>> from using options not available in the application.  If you want
>> to break the dialect then maybe it should be a two step process.
>> 
>> csvwriter = csv.writer(file("newnastiness.csv", "w"), dialect='excel2000')
>> csvwriter.setparams(delimiter='"')

Andrew> This strikes me as B&D.

Just what are you trying to imply by that? :-)

- Dave

-- 
http://www.object-craft.com.au


From djc at object-craft.com.au  Wed Jan 29 01:24:17 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 29 Jan 2003 11:24:17 +1100
Subject: [CSV] Re: First Cut at CSV PEP
In-Reply-To: <1043798625.25139.3395.camel@software1.logiplex.internal>
References: <KJEOLDOPMIDKCMJDCNDPIEPECMAA.altis@semi-retired.com>
	<m34r7ssvke.fsf@ferret.object-craft.com.au>
	<1043798625.25139.3395.camel@software1.logiplex.internal>
Message-ID: <m3lm14pzv2.fsf@ferret.object-craft.com.au>

>>>>> "Cliff" == Cliff Wells <LogiplexSoftware at earthlink.net> writes:

Cliff> On Tue, 2003-01-28 at 15:28, Dave Cole wrote:
>> I suppose that exporting should raise an exception if you specify
>> any variation on the dialect in the writer function.
>> 
>> csvwriter = csv.writer(file("newnastiness.csv", "w"), dialect='excel2000', delimiter='"')
>> 
>> That should raise an exception.

Cliff> I still don't see a good reason for this.  The programmer asked
Cliff> for it, let her do it.  I don't see a problem with letting the
Cliff> programmer shoot herself in the foot, as long as the gun
Cliff> doesn't start out pointing at it.

>> This probably shouldn't raise an exception though:
>> 
>> csvwriter = csv.writer(file("newnastiness.csv", "w"), dialect='excel2000')
>> csvwriter.setparams(delimiter='"')

Cliff> While this provides a workaround, it also seems a bit
Cliff> non-obvious why this should work when passing delimiter as an
Cliff> argument raises an exception.  I'm not dead-set against it, its
Cliff> JMHO.

I think you are right - it is a bad idea in retrospect.

Kevin> The CR, CR/LF, and LF line endings probably have something to
Kevin> do with saving in Mac format, but it may also do some 8-bit
Kevin> character translation.
>>  Should we be trying to handle unicode.  I think we should since
>> Python is now unicode capable.

Cliff> What issues is unicode support going to raise?

The low level parser (C code) is probably going to need to handle
unicode.

>> If it is not a newline, then it is not a newline.

Cliff> This seems like a particularly intractable problem.  If an file
Cliff> can't decide what sort of newlines it is going to use, then I'm
Cliff> not convinced it's the parser's problem.

Cliff> So the question becomes whether to except or pass through.  The
Cliff> two things to consider in this case are:

Cliff> 1) The data might be correct, in which case it should be passed
Cliff> through 2) The target for the data might be someone's
Cliff> mission-critical SQL server and we don't want to help them mung
Cliff> up their data.  An exception would seem appropriate.

Cliff> Frankly, I think I lean towards an exception on this one.
Cliff> There are enough text-processing tools available (dos2unix and
Cliff> kin) that someone should be able to pre-process a CSV file that
Cliff> is raising exceptions and get it into a form acceptable to the
Cliff> parser.  A little work up front is far more acceptable than
Cliff> putting out a fire on someone's database.

Should the reader have an option which turns on universal newline
mode?  This would allow for both behaviours - if a non-conforming
newline is encountered while not in universal newline mode then an
exception would be raised.

According to Andrew's previous message the default setting for Excel97
would be universal newline mode turned on.

- Dave

-- 
http://www.object-craft.com.au


From andrewm at object-craft.com.au  Wed Jan 29 01:25:46 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Wed, 29 Jan 2003 11:25:46 +1100
Subject: [CSV] Number of lines in CSV files 
In-Reply-To: Message from Cliff Wells <LogiplexSoftware@earthlink.net> 
   of "28 Jan 2003 16:15:49 -0800." <1043799349.25146.3400.camel@software1.logiplex.internal> 
References: <15925.58225.712028.494438@montanaro.dyndns.org> <1043795476.25146.3351.camel@software1.logiplex.internal> <m3znpkq0lr.fsf@ferret.object-craft.com.au>  <1043799349.25146.3400.camel@software1.logiplex.internal> 
Message-ID: <20030129002546.493EB3C1F4@coffee.object-craft.com.au>

>So, +1 on punt? <wink>

+1 on punt from me.

>Actually I think this particular aspect would be fairly simple to
>handle.  Another attribute of a dialect could be sizelimits = (maxrows,
>maxcols) and set to (None, None) if the programmer doesn't care or just
>wants to bypass that check.

Kitchen sink - we'll end up making the dialect's too specific for the user
to be able to choose ("do I have Excel2000 with SP2 applied, or...").
I bet it even varies by region of the world (for example, the Chinese
edition probably has different limits).

I have a sneeking suspician that Excels CSV parsing code is resonably
stable - they're probably not game to make changes now that it mostly
works. We might find that dialect="excel" is good enough.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/


From djc at object-craft.com.au  Wed Jan 29 01:28:45 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 29 Jan 2003 11:28:45 +1100
Subject: [CSV] Re: First Cut at CSV PEP
In-Reply-To: <20030129001544.0FD133C1F4@coffee.object-craft.com.au>
References: <KJEOLDOPMIDKCMJDCNDPIEPECMAA.altis@semi-retired.com>
	<m34r7ssvke.fsf@ferret.object-craft.com.au>
	<20030128233947.6C5593C1F4@coffee.object-craft.com.au>
	<m34r7srf7t.fsf@ferret.object-craft.com.au>
	<20030129001544.0FD133C1F4@coffee.object-craft.com.au>
Message-ID: <m3el6wpznm.fsf@ferret.object-craft.com.au>

>>>>> "Andrew" == Andrew McNamara <andrewm at object-craft.com.au> writes:

>>>> That should raise an exception.
>>
Andrew> You mean "raise an exception because the result would be
Andrew> ambiguous", or "raise an exception because it's not
Andrew> excel2000"?
>>  Because it is not 'excel2000'.

Andrew> I don't like it, as I mentioned in my previous e-mail. Excel
Andrew> (97, at least) doesn't let you tweak and tune, so *any*
Andrew> non-default settings are "not excel".

Andrew> A better idea would be to have the dialect turn on
Andrew> "strict_blah" if it's thought necessary.

Probably not.  I now think that my original idea was a bad one.

Andrew> But we still need to raise exceptions on nonsense formats
Andrew> (like using quote as a field separator while also using it as
Andrew> the quote character).

Yup.

Andrew> BTW, I don't have access to Excel 2000, only 97. I'm going to
Andrew> assume they're the same until proven otherwise (bad
Andrew> assumption, I know).
>>  This is a prime example of why we should support dialects.

Andrew> And every dialect should be supported by a wad of tests... 8-)

We need to have a torture test suite (which is manually run against an
application) with which to expose the options which apply to a
dialect.  The results of the torture test then are set in stone as a
regression test for that dialect.

- Dave

-- 
http://www.object-craft.com.au


From LogiplexSoftware at earthlink.net  Wed Jan 29 01:28:46 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 28 Jan 2003 16:28:46 -0800
Subject: [CSV] Re: First Cut at CSV PEP
In-Reply-To: <m3ptqgq08z.fsf@ferret.object-craft.com.au>
References: <15926.1287.36487.12649@montanaro.dyndns.org>
	 <m3d6mhrhx4.fsf@ferret.object-craft.com.au>
	 <1043788652.25139.3222.camel@software1.logiplex.internal>
	 <m3znpkrgbe.fsf@ferret.object-craft.com.au>
	 <20030129000816.2C9153C1F4@coffee.object-craft.com.au>
	 <m3ptqgq08z.fsf@ferret.object-craft.com.au>
Message-ID: <1043800126.25139.3411.camel@software1.logiplex.internal>

On Tue, 2003-01-28 at 16:15, Dave Cole wrote:
> >>>>> "Andrew" == Andrew McNamara <andrewm at object-craft.com.au> writes:
> 
> >> I think it is important to keep in mind the users of the module who
> >> are not expert in the various dialects of CSV.  If presented with a
> >> flat list of all options supported they are going to engage in a
> >> fair amount of head scratching.
> >> 
> >> If we try to make things easier for users by mirroring the options
> >> that their application presents then they are going to have a much
> >> easier time working out how to use the module for their specific
> >> problem.  By limiting the available options based upon the dialect
> >> specified by the user we will be doing them a favour.
> >> 
> >> The point of the 'raw' dialect is to expose the full capabilities
> >> of the raw parser.  Maybe we should use None rather than 'raw'.
> 
> Andrew> My feeling is that this simply changes the shape of the
> Andrew> complexity without really helping.
> 
> Andrew> I think we should just stick with the "a dialect is a set of
> Andrew> defaults" idea.
> 
> Fair enough.

Whew.

> 
> Instead of limiting the tweakable options by raising an exception we
> could have an interface which allowed the user to query the options
> normally associated with a dialect.
> 
> >> Hmm...  What would be the best way to handle Excel TSV.  Maybe a
> >> new dialect 'excel-tsv'?

So are we leaning towards dialects being done as simple classes?  Will
'excel-tsv' simply be defined as 

class excel_tsv(excel_2000):
    delimiter = '\t'

with a dictionary for lookup:

settings = { 'excel-tsv': excel_tsv,
             'excel-2000': excel_2000, }

? 

> Andrew> When saving, Excel97 calls this "Text (Tab delimited)", so
> Andrew> maybe "excel-tab" would be clear enough. CSV is "CSV (Comma
> Andrew> delimited)".
> 
> Yup.
> 
> Andrew> On import, it seems to just guess what the file is - I
> Andrew> couldn't see a way under Excel97 to specify.
> 
> Some kind of sniffing going on.
> 
> Should we have a sniffer in the module?

This hasn't been brought up, but of course one of the major selling
points of DSV is the "sniffing" code.  However, I think I'm with Dave on
having another layer (CSVutils) that would contain this sort of thing.

> >> I am not saying that the wrapper should absolutely prevent someone
> >> from using options not available in the application.  If you want
> >> to break the dialect then maybe it should be a two step process.
> >> 
> >> csvwriter = csv.writer(file("newnastiness.csv", "w"), dialect='excel2000')
> >> csvwriter.setparams(delimiter='"')
> 
> Andrew> This strikes me as B&D.
> 
> Just what are you trying to imply by that? :-)

We should probably leave people's personal issues out of this ;)

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308


From altis at semi-retired.com  Wed Jan 29 01:31:56 2003
From: altis at semi-retired.com (Kevin Altis)
Date: Tue, 28 Jan 2003 16:31:56 -0800
Subject: [CSV] RE: Number of lines in CSV files
In-Reply-To: <1043795476.25146.3351.camel@software1.logiplex.internal>
Message-ID: <KJEOLDOPMIDKCMJDCNDPKEALCNAA.altis@semi-retired.com>

> From: Cliff Wells
>
> Another thing that just occurred to me is that Excel has historically
> been limited in the number of rows and columns that it can import.  This
> number has increased with recent versions (I think it was 32K lines in
> Excel 97, Kevin informs me it's 64K in Excel 2000).
>
> Since export will be a feature of the CSV module, should we have some
> sort of warning or raise an exception when exporting data larger than
> the target application can handle, or should we just punt on this?

+1 on punt

The user may not actually be trying to import into Excel, they may be using
Access, later versions of Excel might support more rows, whatever. Plus,
Excel still imports the data, it just can't deal with more than 64K rows in
Excel 2000.

Now we could very well have some stats generated, maybe as a separate
function if someone wanted to know all the gritty details of which columns
of which rows contained embedded newlines, escaped characters, which rows
had an odd number of columns, total number of rows, whatever. Sort of a CSV
verifier if you will.

ka


From andrewm at object-craft.com.au  Wed Jan 29 01:38:07 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Wed, 29 Jan 2003 11:38:07 +1100
Subject: [CSV] Re: First Cut at CSV PEP 
In-Reply-To: Message from Cliff Wells <LogiplexSoftware@earthlink.net> 
   of "28 Jan 2003 16:28:46 -0800." <1043800126.25139.3411.camel@software1.logiplex.internal> 
References: <15926.1287.36487.12649@montanaro.dyndns.org> <m3d6mhrhx4.fsf@ferret.object-craft.com.au> <1043788652.25139.3222.camel@software1.logiplex.internal> <m3znpkrgbe.fsf@ferret.object-craft.com.au> <20030129000816.2C9153C1F4@coffee.object-craft.com.au> <m3ptqgq08z.fsf@ferret.object-craft.com.au>  <1043800126.25139.3411.camel@software1.logiplex.internal> 
Message-ID: <20030129003807.7185E3C1F4@coffee.object-craft.com.au>

>So are we leaning towards dialects being done as simple classes?  Will
>'excel-tsv' simply be defined as 
>
>class excel_tsv(excel_2000):
>    delimiter = '\t'
>
>with a dictionary for lookup:
>
>settings = { 'excel-tsv': excel_tsv,
>             'excel-2000': excel_2000, }

That seems reasonable. +1

The classes should be exposed by the module, however, so the application
can subclass if need be (or just refer to the classes directly, rather
than going via the str->class mapping).

>This hasn't been brought up, but of course one of the major selling
>points of DSV is the "sniffing" code.  However, I think I'm with Dave on
>having another layer (CSVutils) that would contain this sort of thing.

Yep, +1 from me.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/


From LogiplexSoftware at earthlink.net  Wed Jan 29 01:38:15 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 28 Jan 2003 16:38:15 -0800
Subject: [CSV] Number of lines in CSV files
In-Reply-To: <20030129002546.493EB3C1F4@coffee.object-craft.com.au>
References: <15925.58225.712028.494438@montanaro.dyndns.org>
	 <1043795476.25146.3351.camel@software1.logiplex.internal>
	 <m3znpkq0lr.fsf@ferret.object-craft.com.au>
	 <1043799349.25146.3400.camel@software1.logiplex.internal>
	 <20030129002546.493EB3C1F4@coffee.object-craft.com.au>
Message-ID: <1043800695.14244.3420.camel@software1.logiplex.internal>

On Tue, 2003-01-28 at 16:25, Andrew McNamara wrote:
> >So, +1 on punt? <wink>
> 
> +1 on punt from me.
> 
> >Actually I think this particular aspect would be fairly simple to
> >handle.  Another attribute of a dialect could be sizelimits = (maxrows,
> >maxcols) and set to (None, None) if the programmer doesn't care or just
> >wants to bypass that check.
> 
> Kitchen sink - we'll end up making the dialect's too specific for the user
> to be able to choose ("do I have Excel2000 with SP2 applied, or...").
> I bet it even varies by region of the world (for example, the Chinese
> edition probably has different limits).

What do you mean by "kitchen sink"?  Are you saying that CSV shouldn't
have an option to play tetris while the file is loading?  This is going
to disappoint a lot of emacs users.

Okay, +1 on punting file size.  Unless anyone else cares to argue it I
suppose we'll leave it out.

> I have a sneeking suspician that Excels CSV parsing code is resonably
> stable - they're probably not game to make changes now that it mostly
> works. We might find that dialect="excel" is good enough.

Probably.  This can be fixed via bug reports (and dialects added) if
that changes.

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308


From altis at semi-retired.com  Wed Jan 29 01:39:34 2003
From: altis at semi-retired.com (Kevin Altis)
Date: Tue, 28 Jan 2003 16:39:34 -0800
Subject: [CSV] RE: First Cut at CSV PEP
In-Reply-To: <m3bs20ubhq.fsf@ferret.object-craft.com.au>
Message-ID: <KJEOLDOPMIDKCMJDCNDPMEANCNAA.altis@semi-retired.com>

> From: Dave Cole
>
> >>>>> "Kevin" == Kevin Altis <altis at semi-retired.com> writes:
>
> Kevin> The big issue with the MS/Excel CSV format is that MS doesn't
> Kevin> appear to escape any characters or support import of escaped
> Kevin> characters. A field that contains characters that you might
> Kevin> normally escape (including a comma if that is the separator)
> Kevin> are instead enclosed in double quotes by default and then any
> Kevin> double quotes in the field are doubled.
>
> I thought that we were trying to build a CSV parser which would deal
> with different dialects, not just what Excel does.  Am I wrong making
> that assumption?
>
> If we were to only target Excel our task would be much easier.
>
> I think that we should be trying to come up with an engine wrapped by
> an friendly API which can be made more powerful over time in order to
> parse more and more dialects.

Agreed, certainly support more than just Excel. I think I understand the
dialects thing now. Last night I was getting rubbed the wrong way by
specifying the dialect and then also allowing the specification of
delimitter, quote character, etc. in the same line. I like the idea of using
a dialect and then changing the properties in separate calls.

I suppose there is a good reason that each dialect isn't just a subclass, if
so, the reasoning for using dialects instead of subclasses of a parser might
be called out in the PEP. I can go with it either way.

I would be tempted to call what is currently Excel2000, MSCSV or ExcelCSV.

ka


From djc at object-craft.com.au  Wed Jan 29 01:47:01 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 29 Jan 2003 11:47:01 +1100
Subject: [CSV] Re: First Cut at CSV PEP
In-Reply-To: <1043800126.25139.3411.camel@software1.logiplex.internal>
References: <15926.1287.36487.12649@montanaro.dyndns.org>
	<m3d6mhrhx4.fsf@ferret.object-craft.com.au>
	<1043788652.25139.3222.camel@software1.logiplex.internal>
	<m3znpkrgbe.fsf@ferret.object-craft.com.au>
	<20030129000816.2C9153C1F4@coffee.object-craft.com.au>
	<m3ptqgq08z.fsf@ferret.object-craft.com.au>
	<1043800126.25139.3411.camel@software1.logiplex.internal>
Message-ID: <m37kcopyt6.fsf@ferret.object-craft.com.au>

>>>>> "Cliff" == Cliff Wells <LogiplexSoftware at earthlink.net> writes:

>>  Instead of limiting the tweakable options by raising an exception
>> we could have an interface which allowed the user to query the
>> options normally associated with a dialect.
>> 
>> >> Hmm...  What would be the best way to handle Excel TSV.  Maybe a
>> >> new dialect 'excel-tsv'?

Cliff> So are we leaning towards dialects being done as simple
Cliff> classes?  Will 'excel-tsv' simply be defined as

Cliff> class excel_tsv(excel_2000):
Cliff>    delimiter = '\t'

Cliff> with a dictionary for lookup:

Cliff> settings = { 'excel-tsv': excel_tsv,
Cliff>              'excel-2000': excel_2000,
Cliff> }

Dunno yet.

Here we go again with a potentially bad idea...

I think that there are two things we need to have for each dialect; a
set of low level parser configuration, and a set of user tweakables
(which correspond to options presented by the application).  The set
of user tweakables may not necessarily map one-to-one with low level
parser configuration items.

How would we do this in Python?

>> Should we have a sniffer in the module?

Cliff> This hasn't been brought up, but of course one of the major
Cliff> selling points of DSV is the "sniffing" code.  However, I think
Cliff> I'm with Dave on having another layer (CSVutils) that would
Cliff> contain this sort of thing.

Any sniffer would have to be able to traverse the set of dialects
implemented in the CSV module and look inside them to understand
which options are available to a dialect.

- Dave

-- 
http://www.object-craft.com.au


From LogiplexSoftware at earthlink.net  Wed Jan 29 01:47:20 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 28 Jan 2003 16:47:20 -0800
Subject: [CSV] RE: Number of lines in CSV files
In-Reply-To: <KJEOLDOPMIDKCMJDCNDPKEALCNAA.altis@semi-retired.com>
References: <KJEOLDOPMIDKCMJDCNDPKEALCNAA.altis@semi-retired.com>
Message-ID: <1043801240.25139.3429.camel@software1.logiplex.internal>

On Tue, 2003-01-28 at 16:31, Kevin Altis wrote:
> > From: Cliff Wells
> >
> > Another thing that just occurred to me is that Excel has historically
> > been limited in the number of rows and columns that it can import.  This
> > number has increased with recent versions (I think it was 32K lines in
> > Excel 97, Kevin informs me it's 64K in Excel 2000).
> >
> > Since export will be a feature of the CSV module, should we have some
> > sort of warning or raise an exception when exporting data larger than
> > the target application can handle, or should we just punt on this?
> 
> +1 on punt
> 
> The user may not actually be trying to import into Excel, they may be using
> Access, later versions of Excel might support more rows, whatever. Plus,
> Excel still imports the data, it just can't deal with more than 64K rows in
> Excel 2000.

I guess we need to decide what we mean by "dialect":  do we mean "this
data _will_ import into this application" or do we mean "this data will
be written in a format this application can understand, but might not
necessarily be able to use"?  

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308


From djc at object-craft.com.au  Wed Jan 29 01:48:46 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 29 Jan 2003 11:48:46 +1100
Subject: [CSV] RE: Number of lines in CSV files
In-Reply-To: <1043801240.25139.3429.camel@software1.logiplex.internal>
References: <KJEOLDOPMIDKCMJDCNDPKEALCNAA.altis@semi-retired.com>
	<1043801240.25139.3429.camel@software1.logiplex.internal>
Message-ID: <m3vg08ok5t.fsf@ferret.object-craft.com.au>

>>>>> "Cliff" == Cliff Wells <LogiplexSoftware at earthlink.net> writes:

Cliff> On Tue, 2003-01-28 at 16:31, Kevin Altis wrote:
>> > From: Cliff Wells
>> >
>> > Another thing that just occurred to me is that Excel has
>> historically > been limited in the number of rows and columns that
>> it can import.  This > number has increased with recent versions (I
>> think it was 32K lines in > Excel 97, Kevin informs me it's 64K in
>> Excel 2000).
>> >
>> > Since export will be a feature of the CSV module, should we have
>> some > sort of warning or raise an exception when exporting data
>> larger than > the target application can handle, or should we just
>> punt on this?
>> 
>> +1 on punt
>> 
>> The user may not actually be trying to import into Excel, they may
>> be using Access, later versions of Excel might support more rows,
>> whatever. Plus, Excel still imports the data, it just can't deal
>> with more than 64K rows in Excel 2000.

Cliff> I guess we need to decide what we mean by "dialect": do we mean
Cliff> "this data _will_ import into this application" or do we mean
Cliff> "this data will be written in a format this application can
Cliff> understand, but might not necessarily be able to use"?

I vote for the "this data will be written in a format this application
can understand, but might not necessarily be able to use".  We can
always supplement the code with documentation.

- Dave

-- 
http://www.object-craft.com.au


From LogiplexSoftware at earthlink.net  Wed Jan 29 02:11:33 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 28 Jan 2003 17:11:33 -0800
Subject: [CSV] Re: First Cut at CSV PEP
In-Reply-To: <m37kcopyt6.fsf@ferret.object-craft.com.au>
References: <15926.1287.36487.12649@montanaro.dyndns.org>
	 <m3d6mhrhx4.fsf@ferret.object-craft.com.au>
	 <1043788652.25139.3222.camel@software1.logiplex.internal>
	 <m3znpkrgbe.fsf@ferret.object-craft.com.au>
	 <20030129000816.2C9153C1F4@coffee.object-craft.com.au>
	 <m3ptqgq08z.fsf@ferret.object-craft.com.au>
	 <1043800126.25139.3411.camel@software1.logiplex.internal>
	 <m37kcopyt6.fsf@ferret.object-craft.com.au>
Message-ID: <1043802693.25139.3445.camel@software1.logiplex.internal>

On Tue, 2003-01-28 at 16:47, Dave Cole wrote:
> >>>>> "Cliff" == Cliff Wells <LogiplexSoftware at earthlink.net> writes:
> 
> >>  Instead of limiting the tweakable options by raising an exception
> >> we could have an interface which allowed the user to query the
> >> options normally associated with a dialect.
> >> 
> >> >> Hmm...  What would be the best way to handle Excel TSV.  Maybe a
> >> >> new dialect 'excel-tsv'?
> 
> Cliff> So are we leaning towards dialects being done as simple
> Cliff> classes?  Will 'excel-tsv' simply be defined as
> 
> Cliff> class excel_tsv(excel_2000):
> Cliff>    delimiter = '\t'
> 
> Cliff> with a dictionary for lookup:
> 
> Cliff> settings = { 'excel-tsv': excel_tsv,
> Cliff>              'excel-2000': excel_2000,
> Cliff> }
> 
> Dunno yet.
> 
> Here we go again with a potentially bad idea...
> 
> I think that there are two things we need to have for each dialect; a
> set of low level parser configuration, and a set of user tweakables
> (which correspond to options presented by the application).  The set
> of user tweakables may not necessarily map one-to-one with low level
> parser configuration items.

Can you give examples?  I suppose you are referring to things like CR/LF
translation and spaces around quotes as being low-level parser
configurations and things like delimiters being user-tweakable?

> 
> How would we do this in Python?
> 
> >> Should we have a sniffer in the module?
> 
> Cliff> This hasn't been brought up, but of course one of the major
> Cliff> selling points of DSV is the "sniffing" code.  However, I think
> Cliff> I'm with Dave on having another layer (CSVutils) that would
> Cliff> contain this sort of thing.
> 
> Any sniffer would have to be able to traverse the set of dialects
> implemented in the CSV module and look inside them to understand
> which options are available to a dialect.

Maybe. Currently the sniffing code in DSV just makes a best guess
regarding delimiters, text qualifiers and headers.  Certainly the
dialects could be used to improve its guess (most likely when the
sniffed results are ambiguous or fail).

Using dialects on import is of less importance if sniffing code is
used.  They are two different approaches to the same problem.  If the
user specifies the file as Excel compatible, then sniffing seems rather
redundant, further, if the file is sniffed and the format discovered, it
doesn't seem important which dialect it matches, as long as we are able
to use the sniffed parameters to parse it.

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308


From djc at object-craft.com.au  Wed Jan 29 02:21:42 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 29 Jan 2003 12:21:42 +1100
Subject: [CSV] Re: First Cut at CSV PEP
In-Reply-To: <1043802693.25139.3445.camel@software1.logiplex.internal>
References: <15926.1287.36487.12649@montanaro.dyndns.org>
	<m3d6mhrhx4.fsf@ferret.object-craft.com.au>
	<1043788652.25139.3222.camel@software1.logiplex.internal>
	<m3znpkrgbe.fsf@ferret.object-craft.com.au>
	<20030129000816.2C9153C1F4@coffee.object-craft.com.au>
	<m3ptqgq08z.fsf@ferret.object-craft.com.au>
	<1043800126.25139.3411.camel@software1.logiplex.internal>
	<m37kcopyt6.fsf@ferret.object-craft.com.au>
	<1043802693.25139.3445.camel@software1.logiplex.internal>
Message-ID: <m3ptqgoimx.fsf@ferret.object-craft.com.au>

>>>>> "Cliff" == Cliff Wells <LogiplexSoftware at earthlink.net> writes:

Cliff> On Tue, 2003-01-28 at 16:47, Dave Cole wrote:
>> >>>>> "Cliff" == Cliff Wells <LogiplexSoftware at earthlink.net>
>> writes:
>> 
>> >> Instead of limiting the tweakable options by raising an
>> exception >> we could have an interface which allowed the user to
>> query the >> options normally associated with a dialect.
>> >> 
>> >> >> Hmm...  What would be the best way to handle Excel TSV.
>> Maybe a >> >> new dialect 'excel-tsv'?
>> 
Cliff> So are we leaning towards dialects being done as simple
Cliff> classes?  Will 'excel-tsv' simply be defined as
>>
Cliff> class excel_tsv(excel_2000): delimiter = '\t'
>>
Cliff> with a dictionary for lookup:
>>
Cliff> settings = { 'excel-tsv': excel_tsv, 'excel-2000': excel_2000,
Cliff> }
>>  Dunno yet.
>> 
>> Here we go again with a potentially bad idea...
>> 
>> I think that there are two things we need to have for each dialect;
>> a set of low level parser configuration, and a set of user
>> tweakables (which correspond to options presented by the
>> application).  The set of user tweakables may not necessarily map
>> one-to-one with low level parser configuration items.

Cliff> Can you give examples?  I suppose you are referring to things
Cliff> like CR/LF translation and spaces around quotes as being
Cliff> low-level parser configurations and things like delimiters
Cliff> being user-tweakable?

I do not have access to the software at the moment, but not long ago I
used a program called TOAD which was a GUI for fiddling around with
Oracle as a client.  One of the things you could after executing a
query was export the results to a file.  I seem to recall that the
export dialog has a number of options which do not cleanly map onto
just one of the settings we would place in our writer/reader.

I will see if I can get a screen shot of the dialog...

Cliff> Maybe. Currently the sniffing code in DSV just makes a best
Cliff> guess regarding delimiters, text qualifiers and headers.
Cliff> Certainly the dialects could be used to improve its guess (most
Cliff> likely when the sniffed results are ambiguous or fail).

Cliff> Using dialects on import is of less importance if sniffing code
Cliff> is used.  They are two different approaches to the same
Cliff> problem.  If the user specifies the file as Excel compatible,
Cliff> then sniffing seems rather redundant, further, if the file is
Cliff> sniffed and the format discovered, it doesn't seem important
Cliff> which dialect it matches, as long as we are able to use the
Cliff> sniffed parameters to parse it.

The sniffer is definitely your area of expertise.  I am just making
stuff up as I go :-)

- Dave

-- 
http://www.object-craft.com.au


From andrewm at object-craft.com.au  Wed Jan 29 02:36:01 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Wed, 29 Jan 2003 12:36:01 +1100
Subject: [CSV] Re: First Cut at CSV PEP 
In-Reply-To: Message from Dave Cole <djc@object-craft.com.au> 
   of "29 Jan 2003 11:47:01 +1100." <m37kcopyt6.fsf@ferret.object-craft.com.au> 
References: <15926.1287.36487.12649@montanaro.dyndns.org> <m3d6mhrhx4.fsf@ferret.object-craft.com.au> <1043788652.25139.3222.camel@software1.logiplex.internal> <m3znpkrgbe.fsf@ferret.object-craft.com.au> <20030129000816.2C9153C1F4@coffee.object-craft.com.au> <m3ptqgq08z.fsf@ferret.object-craft.com.au> <1043800126.25139.3411.camel@software1.logiplex.internal>  <m37kcopyt6.fsf@ferret.object-craft.com.au> 
Message-ID: <mailman.3.1159099090.10888.csv@python.org>

>Here we go again with a potentially bad idea...

*-)

>I think that there are two things we need to have for each dialect; a
>set of low level parser configuration, and a set of user tweakables
>(which correspond to options presented by the application).  The set
>of user tweakables may not necessarily map one-to-one with low level
>parser configuration items.

This seems to add a fair bit of complexity to the implementation, without
simplifying the interface much. In particular, it makes it difficult
for the user to move to an alternate dialect (because they'll need to
change all the config options). It also makes it harder for third parties
to implement their own dialects (or maintain the base ones). And it makes
the documenation and tests harder. KISS.

>Any sniffer would have to be able to traverse the set of dialects
>implemented in the CSV module and look inside them to understand
>which options are available to a dialect.

It might be enough to look at the first N lines of the file, and do some
basic stats (tabs per line, commas per line, etc). Whether it guesses a
dialect, or just tries to set individual options is another question.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/


From andrewm at object-craft.com.au  Wed Jan 29 02:36:01 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Wed, 29 Jan 2003 12:36:01 +1100
Subject: [CSV] Re: First Cut at CSV PEP 
In-Reply-To: Message from Dave Cole <djc@object-craft.com.au> 
   of "29 Jan 2003 11:47:01 +1100." <m37kcopyt6.fsf@ferret.object-craft.com.au> 
References: <15926.1287.36487.12649@montanaro.dyndns.org> <m3d6mhrhx4.fsf@ferret.object-craft.com.au> <1043788652.25139.3222.camel@software1.logiplex.internal> <m3znpkrgbe.fsf@ferret.object-craft.com.au> <20030129000816.2C9153C1F4@coffee.object-craft.com.au> <m3ptqgq08z.fsf@ferret.object-craft.com.au> <1043800126.25139.3411.camel@software1.logiplex.internal>  <m37kcopyt6.fsf@ferret.object-craft.com.au> 
Message-ID: <20030129013601.E49083C1F4@coffee.object-craft.com.au>

>Here we go again with a potentially bad idea...

*-)

>I think that there are two things we need to have for each dialect; a
>set of low level parser configuration, and a set of user tweakables
>(which correspond to options presented by the application).  The set
>of user tweakables may not necessarily map one-to-one with low level
>parser configuration items.

This seems to add a fair bit of complexity to the implementation, without
simplifying the interface much. In particular, it makes it difficult
for the user to move to an alternate dialect (because they'll need to
change all the config options). It also makes it harder for third parties
to implement their own dialects (or maintain the base ones). And it makes
the documenation and tests harder. KISS.

>Any sniffer would have to be able to traverse the set of dialects
>implemented in the CSV module and look inside them to understand
>which options are available to a dialect.

It might be enough to look at the first N lines of the file, and do some
basic stats (tabs per line, commas per line, etc). Whether it guesses a
dialect, or just tries to set individual options is another question.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/


From andrewm at object-craft.com.au  Wed Jan 29 02:41:00 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Wed, 29 Jan 2003 12:41:00 +1100
Subject: [CSV] Re: First Cut at CSV PEP 
In-Reply-To: Message from Cliff Wells <LogiplexSoftware@earthlink.net> 
   of "28 Jan 2003 17:11:33 -0800." <1043802693.25139.3445.camel@software1.logiplex.internal> 
References: <15926.1287.36487.12649@montanaro.dyndns.org> <m3d6mhrhx4.fsf@ferret.object-craft.com.au> <1043788652.25139.3222.camel@software1.logiplex.internal> <m3znpkrgbe.fsf@ferret.object-craft.com.au> <20030129000816.2C9153C1F4@coffee.object-craft.com.au> <m3ptqgq08z.fsf@ferret.object-craft.com.au> <1043800126.25139.3411.camel@software1.logiplex.internal> <m37kcopyt6.fsf@ferret.object-craft.com.au>  <1043802693.25139.3445.camel@software1.logiplex.internal> 
Message-ID: <20030129014100.D008A3C1F4@coffee.object-craft.com.au>

>Using dialects on import is of less importance if sniffing code is
>used.  They are two different approaches to the same problem.  If the
>user specifies the file as Excel compatible, then sniffing seems rather
>redundant, further, if the file is sniffed and the format discovered, it
>doesn't seem important which dialect it matches, as long as we are able
>to use the sniffed parameters to parse it.

A client of ours has CSV files being sent to him by many different
sources - a sniffer would be more valuable to him. 

I'd like to assume the rules are consistent within any given file, but I'm
not sure this is even certain in his application. I think the multiple
sources are merged into one file before he gets his hands on them -
it's a pathological situation - he has a diabolical pile of python that
iteratively attempts to produce something useful. Madness lies this way.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/


From skip at pobox.com  Wed Jan 29 03:01:01 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue, 28 Jan 2003 20:01:01 -0600
Subject: [CSV] Re: First Cut at CSV PEP 
In-Reply-To: <20030128232803.C6A943C1F4@coffee.object-craft.com.au>
References: <KJEOLDOPMIDKCMJDCNDPIEPECMAA.altis@semi-retired.com>
        <1043790321.25139.3251.camel@software1.logiplex.internal>
        <m3d6mgswim.fsf@ferret.object-craft.com.au>
        <20030128232803.C6A943C1F4@coffee.object-craft.com.au>
Message-ID: <15927.13789.344190.312001@montanaro.dyndns.org>


    >> Interesting point.  I think that newlines inside records are going to
    >> be the same as those separating records.  Anything else would be very
    >> bizarre.

    Andrew> You should know better than to make a statement like that where
    Andrew> Microsoft is concerned. Excel uses a single LF within fields,
    Andrew> but CRLF at the end of lines. If you import a field containing
    Andrew> CRLF, the CR appears within the field as a box (the "unprintable
    Andrew> character" symbol).

Here's what I can figure out from the samples I saved in Excel today.  I'm
away from the Windows machine now, so I can only infer the titles in the
save menu from the file names, so I may be a bit off in the associations.
Still, here goes:

    File Type           delimiter       hard return     line terminator
    CSV                 comma           LF              CRLF
    DOS Text            TAB             LF              CRLF
    DOS CSV             comma           LF              CRLF
    Mac Text            TAB             LF              CR
    Mac CSV             comma           LF              CR
    Space               yow, this seems all screwed up!
    TSV                 TAB             LF              CRLF
    Unicode CSV         comma           LF              CRLF
    Unicode Text        TAB             LF              CRLF

The Space-separated file looked pretty much like garbage.  I'll have to
check it out more closely tomorrow.  The Unicode CSV file was the same as
the DOS CSV and CSV files (same checksum).  I was thus fairly surprised to
see that the Unicode Text file looked like it had been saved as UTF-16 -
each character is followed by an ASCII NUL and there is a little-endian
UTF-16 BOM at the start of the file.

The table suggests that Excel cares about Windows and Mac line endings, so
we should allow that to be a user-specified option.  Unfortunately, that
means we have to tell people to open files in binary mode, since they will
be passing open file objects.  Doesn't seem very clean to me.  Any ideas?

Skip


From djc at object-craft.com.au  Wed Jan 29 03:01:27 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 29 Jan 2003 13:01:27 +1100
Subject: [CSV] Re: First Cut at CSV PEP
In-Reply-To: <20030129013601.E49083C1F4@coffee.object-craft.com.au>
References: <15926.1287.36487.12649@montanaro.dyndns.org>
	<m3d6mhrhx4.fsf@ferret.object-craft.com.au>
	<1043788652.25139.3222.camel@software1.logiplex.internal>
	<m3znpkrgbe.fsf@ferret.object-craft.com.au>
	<20030129000816.2C9153C1F4@coffee.object-craft.com.au>
	<m3ptqgq08z.fsf@ferret.object-craft.com.au>
	<1043800126.25139.3411.camel@software1.logiplex.internal>
	<m37kcopyt6.fsf@ferret.object-craft.com.au>
	<20030129013601.E49083C1F4@coffee.object-craft.com.au>
Message-ID: <m3d6mg90js.fsf@ferret.object-craft.com.au>

>>>>> "Andrew" == Andrew McNamara <andrewm at object-craft.com.au> writes:

>> Here we go again with a potentially bad idea...

Andrew> *-)

>> I think that there are two things we need to have for each dialect;
>> a set of low level parser configuration, and a set of user
>> tweakables (which correspond to options presented by the
>> application).  The set of user tweakables may not necessarily map
>> one-to-one with low level parser configuration items.

Andrew> This seems to add a fair bit of complexity to the
Andrew> implementation, without simplifying the interface much. In
Andrew> particular, it makes it difficult for the user to move to an
Andrew> alternate dialect (because they'll need to change all the
Andrew> config options). It also makes it harder for third parties to
Andrew> implement their own dialects (or maintain the base ones). And
Andrew> it makes the documenation and tests harder. KISS.

OK.  Yes, it was a bad idea which achieved full potential.

>> Any sniffer would have to be able to traverse the set of dialects
>> implemented in the CSV module and look inside them to understand
>> which options are available to a dialect.

Andrew> It might be enough to look at the first N lines of the file,
Andrew> and do some basic stats (tabs per line, commas per line,
Andrew> etc). Whether it guesses a dialect, or just tries to set
Andrew> individual options is another question.

Just to make your heads hurt a bit more...

In a previous job (at a stock broker) I had to read some CSV data
which had been exported by the MS SQL Server BCP program.  The
excellent BCP program happily exported comma separated data without
quoting fields which contained commas.  Nasty!

I ended up writing some code which post-processed the parsed records
based upon the number of fields.  The post-processing had high level
knowledge of the type of each column so applied heuristics to join
fields back together to get the correct field count.

I remember that the code knew which columns were text, numeric, dates,
times and bit.  The code worked from left to right and tried joining
text columns with trailing fields then asserted that the remaining
fields were consistent with their respective columns.  This continued
until the field count matched the table column count.

All of this was complicated further by the fact that it had to handle
archived data and the table definition changed over time...

- Dave

-- 
http://www.object-craft.com.au


From skip at pobox.com  Wed Jan 29 03:06:55 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue, 28 Jan 2003 20:06:55 -0600
Subject: [CSV] Re: First Cut at CSV PEP 
In-Reply-To: <20030128233947.6C5593C1F4@coffee.object-craft.com.au>
References: <KJEOLDOPMIDKCMJDCNDPIEPECMAA.altis@semi-retired.com>
        <m34r7ssvke.fsf@ferret.object-craft.com.au>
        <20030128233947.6C5593C1F4@coffee.object-craft.com.au>
Message-ID: <15927.14143.403232.338340@montanaro.dyndns.org>


Dunno who said this now, but I disagree with this statement:

    >> I suppose that exporting should raise an exception if you specify any
    >> variation on the dialect in the writer function.

In the proto-PEP I tried to address this issue:

    When processing a dialect setting and one or more of the other
    optional parameters, the dialect parameter is processed first,
    then the others are processed.  This makes it easy to choose a
    dialect, then override one or more of the settings.  For example,
    if a CSV file was generated by Excel 2000 using single quotes as
    the quote character and TAB as the delimiter, you could create a
    reader like::

        csvreader = csv.reader(file("some.csv"), dialect="excel2000",
                               quotechar="'", delimiter='\t')

    Other details of how Excel generates CSV files would be handled
    automatically.

I think we should try our damndest to not raise exceptions.  The
example is just to show that we will allow people to start from a
known state and tweak it.  "This file has all the properties of an
Excel 2000 file except an apostrophe was used as the quote character
and a TAB was used as the delimiter."

Skip


From skip at pobox.com  Wed Jan 29 03:21:20 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue, 28 Jan 2003 20:21:20 -0600
Subject: [CSV] Re: First Cut at CSV PEP
In-Reply-To: <m3znpkrgbe.fsf@ferret.object-craft.com.au>
References: <15926.1287.36487.12649@montanaro.dyndns.org>
        <m3d6mhrhx4.fsf@ferret.object-craft.com.au>
        <1043788652.25139.3222.camel@software1.logiplex.internal>
        <m3znpkrgbe.fsf@ferret.object-craft.com.au>
Message-ID: <15927.15008.879314.896465@montanaro.dyndns.org>


    Dave> The point of the 'raw' dialect is to expose the full capabilities
    Dave> of the raw parser.  Maybe we should use None rather than 'raw'.

Nah, "raw" won't mean anything to anyone.  Make "excel2000" the default.
The point of the dialect names is that they should mean something to
someone.  That generally means application names, not something lile "raw".
I think it also means you only have variants associated with applications
which normally provide few choices.  We can probably all come close to
specifying what the parameter settings are for "excel2000", but what about
"gnumeric"?  As I write this I'm looking at a Gnumeric "Save As" wizard.
The user can choose line termination (LF is the default), delimiter (comma
is the default), quoting style (automatic (default), always, never), and the
quote character (" is the default).  Even though the wizard presents
sensible defaults, I'm less enthusiastic about creating a "gnumeric"
variant, precisely because it won't necessarily mean much.

    Cliff> I think it is an option to save as a TSV file (IIRC), which is
    Cliff> the same as a CSV file, but with tabs.

    Dave> Hmm...  What would be the best way to handle Excel TSV.  Maybe a
    Dave> new dialect 'excel-tsv'?

Any of:

    reader = csv.reader(file("some.csv"), variant="excel2000-tsv")

or

    reader = csv.reader(file("some.csv"), variant="excel2000",
                        delimiter='\t')

or (assuming "excel2000" is the default), just:

    reader = csv.reader(file("some.csv"), delimiter='\t')

    Dave> I am not saying that the wrapper should absolutely prevent someone
    Dave> from using options not available in the application.  If you want to
    Dave> break the dialect then maybe it should be a two step process.

    Dave>     csvwriter = csv.writer(file("newnastiness.csv", "w"),
    Dave>                            dialect='excel2000')
    Dave>     csvwriter.setparams(delimiter='"')

That seems cumbersome.  I think we have to give our users both some credit
(for brains) and some flexibility.  It seems gratuitous (and unPythonic) to
specify some parameters in the constructor and some in a later method.

All this dialect stuff will be handled at the Python level, right?

Skip


From skip at pobox.com  Wed Jan 29 03:30:23 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue, 28 Jan 2003 20:30:23 -0600
Subject: [CSV] Re: First Cut at CSV PEP
In-Reply-To: <m3lm14rfkf.fsf@ferret.object-craft.com.au>
References: <15926.1287.36487.12649@montanaro.dyndns.org>
        <KJEOLDOPMIDKCMJDCNDPMEPFCMAA.altis@semi-retired.com>
        <15926.64576.481489.373053@montanaro.dyndns.org>
        <m3lm14rfkf.fsf@ferret.object-craft.com.au>
Message-ID: <15927.15551.93504.635849@montanaro.dyndns.org>


    Skip> I'm not so sure this mapping None to "None" on output is such a
    Skip> good idea because it's not reversible in all situations and hurts
    Skip> portability to other systems (e.g., does Excel have a concept of
    Skip> None? what happens if you have a text field which just happens to
    Skip> contain "None"?).

    Dave> I think that None should always be written as a zero length field,
    Dave> and always read as the field value 'None'

I'm really skeptical of this.  There is just no equivalence between None and
''.  Right now using the Object Craft csv module, a blank field comes
through as an empty string.  I think that's the correct behavior.

    Skip> I think we need to limit the data which can be output to strings,
    Skip> Unicode strings (if we use an encoded stream), floats and ints.
    Skip> Anything else should raise TypeError.

    Dave> Is there any merit having the writer handling non-string data by
    Dave> producing an empty field for None, and the result of
    Dave> PyObject_Str() for all other values?

I'm not sure.  I'm inclined to not allow anything other than what I said
above.  Certainly, compound objects should raise exceptions.  I think of CSV
more like XML-RPC than Pyro.  We're trying to exchange data with as many
other languages and applications as possible, not create a new protocol for
exchanging data with other Python programs.  CSV is designed to represent
the numeric and string values in spreadsheets and databases.  Going too far
beyond that seems like out-of-scope to me, especially if this is to get into
2.3.  Remember, 2.3a1 is already out there!

    Dave> That raises another implementation issue.  If you export from
    Dave> Excel, does it always quote fields?  If not then the default
    Dave> dialect behaviour should not unconditionally quote fields.

Not in my limited experience.  It quotes only where necessary (fields
containing delimiters or starting with the quote character).

    Dave> We could/should support mandatoryquote as a writer option.

This is something Laurence Tratt's original CSV module did (his ASV module
probably does as well).  I used it all the time.  Gnumeric provides
"always", "as needed" and "never".  I don't know how you'd do "never"
without specifying an escape character.  I just tried "never" while saving
CSV data from Gnumeric.  It didn't escape embedded commas, so it effectively
toasted the data.  

Skip


From skip at pobox.com  Wed Jan 29 03:36:04 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue, 28 Jan 2003 20:36:04 -0600
Subject: [CSV] Re: First Cut at CSV PEP
In-Reply-To: <1043798625.25139.3395.camel@software1.logiplex.internal>
References: <KJEOLDOPMIDKCMJDCNDPIEPECMAA.altis@semi-retired.com>
        <m34r7ssvke.fsf@ferret.object-craft.com.au>
        <1043798625.25139.3395.camel@software1.logiplex.internal>
Message-ID: <15927.15892.377982.750393@montanaro.dyndns.org>


    Cliff> Frankly, I think I lean towards an exception on this one.  There
    Cliff> are enough text-processing tools available (dos2unix and kin)
    Cliff> that someone should be able to pre-process a CSV file that is
    Cliff> raising exceptions and get it into a form acceptable to the
    Cliff> parser.  A little work up front is far more acceptable than
    Cliff> putting out a fire on someone's database.

How would you handle this example?  You saved a file in Excel which
contained "hard returns".  Line termination is thus CRLF and hard returns
are LF.  Bring it over to your Unix system, run dos2unix on it, read it into
Python, fiddle with it and write it out.  Now run unix2dos and push it back
to the Windows machine for viewing with Excel.  Gues what just happened to
those "hard returns"?

:-(

Like you said, this may indeed be a very hard, or intractable problem.

I propose we not spend any more time on it now, but add it as an issue and
get some feedback from the broader community when an initial version of the
PEP is released (which I'd like to do in the next couple of days).

Skip


From csv-request at manatee.mojam.com  Wed Jan 29 03:39:31 2003
From: csv-request at manatee.mojam.com (csv-request at manatee.mojam.com)
Date: Tue, 28 Jan 2003 20:39:31 -0600
Subject: Welcome to the "Csv" mailing list
Message-ID: <200301290239.h0T2dVPL007061@manatee.mojam.com>

Welcome to the Csv at manatee.mojam.com mailing list!

To post to this list, send your email to:

  csv at manatee.mojam.com

General information about the mailing list is at:

  http://manatee.mojam.com/mailman/listinfo/csv

If you ever want to unsubscribe or change your options (eg, switch to
or from digest mode, change your password, etc.), visit your
subscription page at:

  http://manatee.mojam.com/mailman/options/csv/andrewm%40object-craft.com.au


You can also make such adjustments via email by sending a message to:

  Csv-request at manatee.mojam.com

with the word `help' in the subject or body (don't include the
quotes), and you will get back a message with instructions.

You must know your password to change your options (including changing
the password, itself) or to unsubscribe.  It is:

  uhzuug

If you forget your password, don't worry, you will receive a monthly
reminder telling you what all your manatee.mojam.com mailing list
passwords are, and how to unsubscribe or change your options.  There
is also a button on your options page that will email your current
password to you.

You may also have your password mailed to you automatically off of the
Web page noted above.


From skip at pobox.com  Wed Jan 29 03:45:45 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue, 28 Jan 2003 20:45:45 -0600
Subject: First Cut at CSV PEP
In-Reply-To: <m3d6mgrfdj.fsf@ferret.object-craft.com.au>
References: <15926.1287.36487.12649@montanaro.dyndns.org>
        <1043794249.14244.3330.camel@software1.logiplex.internal>
        <15927.3700.803751.757376@montanaro.dyndns.org>
        <m3d6mgrfdj.fsf@ferret.object-craft.com.au>
Message-ID: <15927.16473.998484.348688@montanaro.dyndns.org>


    Skip> Nope.  I could set up a Mailman list on the Mojam server if you
    Skip> don't think that's too much overkill.

    Dave> Do it.  We can then use URL's to old messages.

You got it.  We've all been subscribed and you should each have received a
welcome message by now.  I will make sure list messages are archived, and
once the PEP is published, use that as the response address for comments.
All five of us have been subscribed.  The posting address is
csv at mail.mojam.com.

I'll run spambayes in front of Mailman so I can leave open posting enabled
yet not drown in a sea of spam (which will almost certainly begin shortly
after the address is published).  If you use procmail or other mail
filtering tools, you can key on this header:

    X-Spambayes-Classification: ham; 0.00

where "ham" (good mail) may be replaced by "spam" or "unsure".  The number
will range from 0.00 ("certain ham") to 1.00 ("certain spam").

Skip


From skip at pobox.com  Wed Jan 29 03:49:53 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue, 28 Jan 2003 20:49:53 -0600
Subject: [Csv] test message
Message-ID: <15927.16721.284748.270083@montanaro.dyndns.org>

Just a test - did I screw anything up?

Skip
_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From skip at pobox.com  Wed Jan 29 03:55:28 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue, 28 Jan 2003 20:55:28 -0600
Subject: [Csv] 'Nuther test
Message-ID: <15927.17056.525801.505496@montanaro.dyndns.org>

test 2
_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From djc at object-craft.com.au  Wed Jan 29 04:20:44 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 29 Jan 2003 14:20:44 +1100
Subject: [CSV] Re: First Cut at CSV PEP
In-Reply-To: <15927.13789.344190.312001@montanaro.dyndns.org>
References: <KJEOLDOPMIDKCMJDCNDPIEPECMAA.altis@semi-retired.com>
	<1043790321.25139.3251.camel@software1.logiplex.internal>
	<m3d6mgswim.fsf@ferret.object-craft.com.au>
	<20030128232803.C6A943C1F4@coffee.object-craft.com.au>
	<15927.13789.344190.312001@montanaro.dyndns.org>
Message-ID: <m34r7s8wvn.fsf@ferret.object-craft.com.au>


skip> The table suggests that Excel cares about Windows and Mac line
skip> endings, so we should allow that to be a user-specified option.
skip> Unfortunately, that means we have to tell people to open files
skip> in binary mode, since they will be passing open file objects.
skip> Doesn't seem very clean to me.  Any ideas?

Failing to open a file in binary mode is already a gotcha in Python.
If someone wants to force a particular end of line in the writer then
they must be prepared to open the file in binary mode.

- Dave

-- 
http://www.object-craft.com.au

_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From djc at object-craft.com.au  Wed Jan 29 04:25:16 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 29 Jan 2003 14:25:16 +1100
Subject: [CSV] Re: First Cut at CSV PEP
In-Reply-To: <15927.14143.403232.338340@montanaro.dyndns.org>
References: <KJEOLDOPMIDKCMJDCNDPIEPECMAA.altis@semi-retired.com>
	<m34r7ssvke.fsf@ferret.object-craft.com.au>
	<20030128233947.6C5593C1F4@coffee.object-craft.com.au>
	<15927.14143.403232.338340@montanaro.dyndns.org>
Message-ID: <m3znpk7i3n.fsf@ferret.object-craft.com.au>

>>>>> "Skip" == Skip Montanaro <skip at pobox.com> writes:

Skip> Dunno who said this now, but I disagree with this statement:

>>> I suppose that exporting should raise an exception if you specify
>>> any variation on the dialect in the writer function.

That was me.  I now agree that it is a bad idea.  Andrew suggested
that we apply the KISS principle.  I agree with his suggestion that a
dialect just defines a collection of settings in the parser.  You are
then free to redefine any or all of those settings as additional
keyword arguments to the csv.reader() or csv.writer() functions.

Skip> In the proto-PEP I tried to address this issue:

Skip>     When processing a dialect setting and one or more of the
Skip> other optional parameters, the dialect parameter is processed
Skip> first, then the others are processed.  This makes it easy to
Skip> choose a dialect, then override one or more of the settings.
Skip> For example, if a CSV file was generated by Excel 2000 using
Skip> single quotes as the quote character and TAB as the delimiter,
Skip> you could create a reader like::

Skip>   csvreader = csv.reader(file("some.csv"),
Skip>                          dialect="excel2000", quotechar="'",
Skip>                          delimiter='\t')

I think that we now in violent agreement.  A good thing.

- Dave

-- 
http://www.object-craft.com.au

_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From skip at pobox.com  Wed Jan 29 04:25:49 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue, 28 Jan 2003 21:25:49 -0600
Subject: [Csv] List up and running - mostly
Message-ID: <15927.18877.657911.91142@montanaro.dyndns.org>


Posting to the list seems to be working okay.  Nothing seems to be archived
though.  I'll try and get that resolved by midday tomorrow.  I'm kinda
pooped though and need to knock off for the evening.

I will ask David Goodger, the PEP editor, for a number for the PEP so we can
check it in and share the writing.

Skip

_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From skip at pobox.com  Wed Jan 29 04:27:37 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue, 28 Jan 2003 21:27:37 -0600
Subject: [CSV] Re: First Cut at CSV PEP
In-Reply-To: <m34r7s8wvn.fsf@ferret.object-craft.com.au>
References: <KJEOLDOPMIDKCMJDCNDPIEPECMAA.altis@semi-retired.com>
        <1043790321.25139.3251.camel@software1.logiplex.internal>
        <m3d6mgswim.fsf@ferret.object-craft.com.au>
        <20030128232803.C6A943C1F4@coffee.object-craft.com.au>
        <15927.13789.344190.312001@montanaro.dyndns.org>
        <m34r7s8wvn.fsf@ferret.object-craft.com.au>
Message-ID: <15927.18985.513079.628267@montanaro.dyndns.org>


    Dave> Failing to open a file in binary mode is already a gotcha in
    Dave> Python.  If someone wants to force a particular end of line in the
    Dave> writer then they must be prepared to open the file in binary mode.

Then I guess we just document the wart. ;-)

Skip
_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From skip at pobox.com  Wed Jan 29 04:30:49 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue, 28 Jan 2003 21:30:49 -0600
Subject: [Csv] CVS checkin privileges
Message-ID: <15927.19177.471122.352771@montanaro.dyndns.org>

Dave, Andrew & Cliff have been added as developers to the Python project.
At Kevin's request he wasn't added.

G'night...

Skip

_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From skip at pobox.com  Wed Jan 29 04:34:03 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue, 28 Jan 2003 21:34:03 -0600
Subject: [Csv] More mailing lists ;-)
Message-ID: <15927.19371.648045.184281@montanaro.dyndns.org>

Barry Warsaw suggested you also subscribe to python-checkins.  I'm less
certain you'll find that interesting, but it's the only way you'll see the
checkins others make to our little sandbox.

Skip

_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From djc at object-craft.com.au  Wed Jan 29 04:36:50 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 29 Jan 2003 14:36:50 +1100
Subject: [CSV] Re: First Cut at CSV PEP
In-Reply-To: <15927.15008.879314.896465@montanaro.dyndns.org>
References: <15926.1287.36487.12649@montanaro.dyndns.org>
	<m3d6mhrhx4.fsf@ferret.object-craft.com.au>
	<1043788652.25139.3222.camel@software1.logiplex.internal>
	<m3znpkrgbe.fsf@ferret.object-craft.com.au>
	<15927.15008.879314.896465@montanaro.dyndns.org>
Message-ID: <m3vg087hkd.fsf@ferret.object-craft.com.au>

>>>>> "Skip" == Skip Montanaro <skip at pobox.com> writes:

Dave> The point of the 'raw' dialect is to expose the full
Dave> capabilities of the raw parser.  Maybe we should use None rather
Dave> than 'raw'.

Skip> Nah, "raw" won't mean anything to anyone.  Make "excel2000" the
Skip> default.  The point of the dialect names is that they should
Skip> mean something to someone.  That generally means application
Skip> names, not something lile "raw".  I think it also means you only
Skip> have variants associated with applications which normally
Skip> provide few choices.  We can probably all come close to
Skip> specifying what the parameter settings are for "excel2000", but
Skip> what about "gnumeric"?  As I write this I'm looking at a
Skip> Gnumeric "Save As" wizard.  The user can choose line termination
Skip> (LF is the default), delimiter (comma is the default), quoting
Skip> style (automatic (default), always, never), and the quote
Skip> character (" is the default).  Even though the wizard presents
Skip> sensible defaults, I'm less enthusiastic about creating a
Skip> "gnumeric" variant, precisely because it won't necessarily mean
Skip> much.

Before we get too excited about setting dialect names in stone we
might want to start on the torture test.  It seems logical (to me)
that the first step in cataloguing dialects is to define the
classification tool.  We may find that many applications are faithful
clones of 'excel' (rather than 'excel2000', 'excel97', 'excel.net').

Cliff> I think it is an option to save as a TSV file (IIRC), which is
Cliff> the same as a CSV file, but with tabs.

Dave> Hmm...  What would be the best way to handle Excel TSV.  Maybe a
Dave> new dialect 'excel-tsv'?

Skip> Any of:

Skip>     reader = csv.reader(file("some.csv"),
Skip>                         variant="excel2000-tsv")

Are you suggesting that each dialect have a collection of variants?
This would mean you would have two layers of settings (is this a good
thing?)  The variant could just be a way of layering a set of options
over the options defined by a dialect.  I can see Andrew telling us to
KISS.

Dave> I am not saying that the wrapper should absolutely prevent
Dave> someone from using options not available in the application.  If
Dave> you want to break the dialect then maybe it should be a two step
Dave> process.

Dave> csvwriter = csv.writer(file("newnastiness.csv", "w"),
Dave>                        dialect='excel2000')
Dave> csvwriter.setparams(delimiter='"')

Skip> That seems cumbersome.  I think we have to give our users both
Skip> some credit (for brains) and some flexibility.  It seems
Skip> gratuitous (and unPythonic) to specify some parameters in the
Skip> constructor and some in a later method.

I have been convinced now that this is a bad idea.

Skip> All this dialect stuff will be handled at the Python level,
Skip> right?

Yes.  In my mind all that the extension module would be is an engine
with a set of configurable items.  No knowledge of dialects (or
variants) would be in the C code.

- Dave

-- 
http://www.object-craft.com.au

_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From djc at object-craft.com.au  Wed Jan 29 04:45:10 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 29 Jan 2003 14:45:10 +1100
Subject: [CSV] Re: First Cut at CSV PEP
In-Reply-To: <15927.15551.93504.635849@montanaro.dyndns.org>
References: <15926.1287.36487.12649@montanaro.dyndns.org>
	<KJEOLDOPMIDKCMJDCNDPMEPFCMAA.altis@semi-retired.com>
	<15926.64576.481489.373053@montanaro.dyndns.org>
	<m3lm14rfkf.fsf@ferret.object-craft.com.au>
	<15927.15551.93504.635849@montanaro.dyndns.org>
Message-ID: <m3r8aw7h6h.fsf@ferret.object-craft.com.au>

>>>>> "Skip" == Skip Montanaro <skip at pobox.com> writes:

Skip> I'm not so sure this mapping None to "None" on output is such a
Skip> good idea because it's not reversible in all situations and
Skip> hurts portability to other systems (e.g., does Excel have a
Skip> concept of None? what happens if you have a text field which
Skip> just happens to contain "None"?).

Dave> I think that None should always be written as a zero length
Dave> field, and always read as the field value 'None'

Skip> I'm really skeptical of this.  There is just no equivalence
Skip> between None and ''.  Right now using the Object Craft csv
Skip> module, a blank field comes through as an empty string.  I think
Skip> that's the correct behavior.

I think I was unnecessarily clumsy in my explanation.  This is what I
was trying to say:

>>> w = csv.writer(sys.stdio)
>>> w.write(['','hello',None])
',hello,\n'
>>> r = csv.reader(StringIO('None,hello,'))
>>> for l in csv: print r
['None','hello','']

Skip> I think we need to limit the data which can be output to
Skip> strings, Unicode strings (if we use an encoded stream), floats
Skip> and ints.  Anything else should raise TypeError.

Dave> Is there any merit having the writer handling non-string data by
Dave> producing an empty field for None, and the result of
Dave> PyObject_Str() for all other values?

Skip> I'm not sure.  I'm inclined to not allow anything other than
Skip> what I said above.  Certainly, compound objects should raise
Skip> exceptions.  I think of CSV more like XML-RPC than Pyro.  We're
Skip> trying to exchange data with as many other languages and
Skip> applications as possible, not create a new protocol for
Skip> exchanging data with other Python programs.  CSV is designed to
Skip> represent the numeric and string values in spreadsheets and
Skip> databases.  Going too far beyond that seems like out-of-scope to
Skip> me, especially if this is to get into 2.3.  Remember, 2.3a1 is
Skip> already out there!

OK.  The current version of the CSV module does what I was suggesting.
We will just have to remove that code.

Dave> That raises another implementation issue.  If you export from
Dave> Excel, does it always quote fields?  If not then the default
Dave> dialect behaviour should not unconditionally quote fields.

Skip> Not in my limited experience.  It quotes only where necessary
Skip> (fields containing delimiters or starting with the quote
Skip> character).

Dave> We could/should support mandatoryquote as a writer option.

Skip> This is something Laurence Tratt's original CSV module did (his
Skip> ASV module probably does as well).  I used it all the time.
Skip> Gnumeric provides "always", "as needed" and "never".  I don't
Skip> know how you'd do "never" without specifying an escape
Skip> character.  I just tried "never" while saving CSV data from
Skip> Gnumeric.  It didn't escape embedded commas, so it effectively
Skip> toasted the data.

I have seen that happen in other applications.

- Dave

-- 
http://www.object-craft.com.au

_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From andrewm at object-craft.com.au  Wed Jan 29 07:05:01 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Wed, 29 Jan 2003 17:05:01 +1100
Subject: [Csv] CSV interface question
Message-ID: <20030129060501.DB9193C1F4@coffee.object-craft.com.au>

In the proposed PEP, we have separate instances for reading and writing.
In the Object Craft csv module, a single instance is shared by the parse
and join methods - the only virtue of this is config is shared (so the
same options are used to write the file as were used to read the file).

Maybe we should consider a "container of options" class (of which the
dialects would be subclasses). The sniffing code could then return an
instance of this class (which wouldn't necessarily be a dialect). With
this, you might do things like:

    options = csv.sniffer(open("foobar.csv"))
    for fields in csv.reader(open("foobar.csv"), options)
        ... do stuff

    csvwriter = csv.writer(open("newfoovar.csv", "w"), options)
    try:
        for fields in whatever:
            csvwriter.write(fields)
    finally:
        csvwriter.close()

The idea being you'd then re-write the file with the same sniffed options.

Another idea occurs - looping over an iteratable is going to be common - we
could probably supply a convenience function, say "writelines(iteratable)"?

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/
_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From andrewm at object-craft.com.au  Wed Jan 29 11:16:58 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Wed, 29 Jan 2003 21:16:58 +1100
Subject: [Csv] CSV interface question 
In-Reply-To: Message from Andrew McNamara <andrewm@object-craft.com.au> 
   of "Wed, 29 Jan 2003 17:05:01 +1100." <20030129060501.DB9193C1F4@coffee.object-craft.com.au> 
References: <20030129060501.DB9193C1F4@coffee.object-craft.com.au> 
Message-ID: <20030129101658.4E95C3C1F4@coffee.object-craft.com.au>

>Maybe we should consider a "container of options" class (of which the
>dialects would be subclasses). The sniffing code could then return an
>instance of this class (which wouldn't necessarily be a dialect). With
>this, you might do things like:

Another thought - rather than specify the dialect name as a string,
it could be specified as a class or instance - something like:

    csv.reader(fileobj, csv.dialect.excel)

Thoughts?

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/
_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From djc at object-craft.com.au  Wed Jan 29 11:50:30 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 29 Jan 2003 21:50:30 +1100
Subject: [Csv] CSV interface question
In-Reply-To: <20030129101658.4E95C3C1F4@coffee.object-craft.com.au>
References: <20030129060501.DB9193C1F4@coffee.object-craft.com.au>
	<20030129101658.4E95C3C1F4@coffee.object-craft.com.au>
Message-ID: <m3adhk9qmh.fsf@ferret.object-craft.com.au>

>>>>> "Andrew" == Andrew McNamara <andrewm at object-craft.com.au> writes:

>> Maybe we should consider a "container of options" class (of which
>> the dialects would be subclasses). The sniffing code could then
>> return an instance of this class (which wouldn't necessarily be a
>> dialect). With this, you might do things like:

Andrew> Another thought - rather than specify the dialect name as a
Andrew> string, it could be specified as a class or instance -
Andrew> something like:

Andrew>     csv.reader(fileobj, csv.dialect.excel)

Andrew> Thoughts?

Is there a downside to this?  I can't see one immediately.

- Dave

-- 
http://www.object-craft.com.au

_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From djc at object-craft.com.au  Wed Jan 29 11:53:11 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 29 Jan 2003 21:53:11 +1100
Subject: [Csv] Getting some files in place
Message-ID: <m365s89qi0.fsf@ferret.object-craft.com.au>

I am currently converting the CSV module to something which at least
looks like it is native Python C code.  I will commit to the sandbox
soon.

This is a chance to bring my Python guts knowledge up to date.
Probably going to take a few goes though.

- Dave

-- 
http://www.object-craft.com.au

_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From djc at object-craft.com.au  Wed Jan 29 12:55:08 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 29 Jan 2003 22:55:08 +1100
Subject: [Csv] My version of the PEP
Message-ID: <m3ptqg892b.fsf@ferret.object-craft.com.au>

I had all sorts of grand plans for the PEP during the day which
involved dialects and validation of options used on dialects.  I was
also going to write it up tonight.

In retrospect there is very little of what I was proposing which I
still think is worthwhile.  Andrew has sent me a small Python module
which almost completely implements the current PEP - I have asked him
to commit it to the sandbox.

If you look at the sandbox now you will notice that I have committed a
reformatted version of our csv parser.  We are fairly close to having
something concrete to play with.

- Dave

-- 
http://www.object-craft.com.au

_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From skip at pobox.com  Wed Jan 29 14:33:04 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 29 Jan 2003 07:33:04 -0600
Subject: [Csv] PEP checked in
Message-ID: <15927.55312.962767.436646@montanaro.dyndns.org>

I asked David Goodger for a number for the CSV PEP.  He checked it in as PEP
305.  You can edit it via cvs from the .../python/nondist/peps directory.

Skip

_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From skip at pobox.com  Wed Jan 29 15:10:44 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 29 Jan 2003 08:10:44 -0600
Subject: [Csv] some PEP reorg
Message-ID: <15927.57572.299432.893613@montanaro.dyndns.org>

I reorganized the parameter descriptions and added set_dialect and
get_dialect functions.  The job is incomplete, but I have to get to work.
Feel free to flesh things out more.

Skip

_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From andrewm at object-craft.com.au  Wed Jan 29 15:20:13 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Thu, 30 Jan 2003 01:20:13 +1100
Subject: [Csv] My version of the PEP 
In-Reply-To: Message from Dave Cole <djc@object-craft.com.au> 
   of "29 Jan 2003 22:55:08 +1100." <m3ptqg892b.fsf@ferret.object-craft.com.au> 
References: <m3ptqg892b.fsf@ferret.object-craft.com.au> 
Message-ID: <20030129142013.B80303C1F4@coffee.object-craft.com.au>

>Andrew has sent me a small Python module which almost completely
>implements the current PEP - I have asked him to commit it to the sandbox.

Okay - I've commited it. It's pretty crude, and contains no 
docstrings yet. Time for bed.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/
_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From skip at pobox.com  Wed Jan 29 16:56:56 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 29 Jan 2003 09:56:56 -0600
Subject: [CSV] Number of lines in CSV files
In-Reply-To: <1043800695.14244.3420.camel@software1.logiplex.internal>
References: <15925.58225.712028.494438@montanaro.dyndns.org>
        <1043795476.25146.3351.camel@software1.logiplex.internal>
        <m3znpkq0lr.fsf@ferret.object-craft.com.au>
        <1043799349.25146.3400.camel@software1.logiplex.internal>
        <20030129002546.493EB3C1F4@coffee.object-craft.com.au>
        <1043800695.14244.3420.camel@software1.logiplex.internal>
Message-ID: <15927.63944.112239.481587@montanaro.dyndns.org>


    Cliff> Okay, +1 on punting file size.  Unless anyone else cares to argue
    Cliff> it I suppose we'll leave it out.

I don't know how you could support it if a csv reader is an iterable.  You
wouldn't know until you encountered a row with more than max columns or read
the read which exceeded the max rows.  Similarly, just because I want my CSV
file to be formatted the same way Excel does things doesn't mean I am going
to load the file into Excel.

    >> I have a sneeking suspician that Excels CSV parsing code is resonably
    >> stable - they're probably not game to make changes now that it mostly
    >> works. We might find that dialect="excel" is good enough.

    Cliff> Probably.  This can be fixed via bug reports (and dialects added)
    Cliff> if that changes.

"excel" it is.  I believe that should be fine for Excel 97 and Excel 2000
(ISTR that Excel 2000 is just Excel 97 bundled in Office 2000).  Any
distinctions with older versions can be tagged, e.g., "excel95", "excel4",
though I suspect they may also be the same.

Skip

_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From skip at pobox.com  Wed Jan 29 16:58:45 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 29 Jan 2003 09:58:45 -0600
Subject: [Csv] Re: First Cut at CSV PEP
In-Reply-To: <KJEOLDOPMIDKCMJDCNDPMEANCNAA.altis@semi-retired.com>
References: <m3bs20ubhq.fsf@ferret.object-craft.com.au>
        <KJEOLDOPMIDKCMJDCNDPMEANCNAA.altis@semi-retired.com>
Message-ID: <15927.64053.523071.753657@montanaro.dyndns.org>


    Kevin> I suppose there is a good reason that each dialect isn't just a
    Kevin> subclass, if so, the reasoning for using dialects instead of
    Kevin> subclasses of a parser might be called out in the PEP. I can go
    Kevin> with it either way.

Overkill, I think.  The engine never changes.  All we are doing is making it
easy to set a bunch of parameters.

Skip
_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From skip at pobox.com  Wed Jan 29 17:02:05 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 29 Jan 2003 10:02:05 -0600
Subject: [Csv] RE: Number of lines in CSV files
In-Reply-To: <1043801240.25139.3429.camel@software1.logiplex.internal>
References: <KJEOLDOPMIDKCMJDCNDPKEALCNAA.altis@semi-retired.com>
        <1043801240.25139.3429.camel@software1.logiplex.internal>
Message-ID: <15927.64253.371647.216288@montanaro.dyndns.org>


    Cliff> I guess we need to decide what we mean by "dialect": do we mean
    Cliff> "this data _will_ import into this application" or do we mean
    Cliff> "this data will be written in a format this application can
    Cliff> understand, but might not necessarily be able to use"?

When I proposed "variant" and later "dialect" I was only referring to the
format of the file.  I wasn't concerned directly with whether a specific
application would be able to process it.  For example, it appears that
Gnumeric can import Excel-generated CSV files just fine.  Accordingly, if I
know I'm going to read the file into Gnumeric, I might just as well specify
"excel" as the dialect for the writer.

Skip

_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From skip at pobox.com  Wed Jan 29 17:11:34 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 29 Jan 2003 10:11:34 -0600
Subject: [Csv] Coding dialects
In-Reply-To: <1043800126.25139.3411.camel@software1.logiplex.internal>
Message-ID: <15927.64822.167086.284052@montanaro.dyndns.org>


(Changing the subject to suit to topic a bit better...)

    Cliff> So are we leaning towards dialects being done as simple classes?
    Cliff> Will 'excel-tsv' simply be defined as

    Cliff> class excel_tsv(excel_2000):
    Cliff>     delimiter = '\t'

    Cliff> with a dictionary for lookup:

    Cliff> settings = { 'excel-tsv': excel_tsv,
    Cliff>              'excel-2000': excel_2000, }

    Cliff> ? 

I was thinking of dialects as dicts.  You'd have

    excel_dialect = { "quotechar": '"',
                      "delimiter": ',',
                      "linetermintor": '\r\n',
                      ...
                    }

with a corresponding mapping as you suggested:

    settings = { 'excel': excel_dialect,
                 'excel-tsv: excel_tabs_dialect, }

then in the factory functions do something like:

    def reader(fileobj, dialect="excel", **kwds):
        kwargs = copy.copy(settings[dialect])
        kwargs.update(kwds)
        # possible sanity check on kwargs here ...
        return _csv.reader(fileobj, **kwargs)

Perhaps we could distribute a dialects.csv file ;-) with the module which
defines the supported dialects.  That file would be loaded upon initial
import to define the various dialect dicts.

Skip

_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From skip at pobox.com  Wed Jan 29 17:16:44 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 29 Jan 2003 10:16:44 -0600
Subject: [Csv] Sniffing dialects
In-Reply-To: <1043802693.25139.3445.camel@software1.logiplex.internal>
Message-ID: <15927.65132.432457.594501@montanaro.dyndns.org>

If my notion of dialects as dicts isn't too far off-base, the sniffing code
could just return a dict.  That would be a good way to define new dialects.
Someone could send us a CSV file from a particular application.  We'd turn
the sniffer loose on it then append the result to our dialects.csv file.

(A different version of) the sniffer could take an optional dialect string
as an arg and either use it as the starting point (for stuff it can't
discern, like hard returns in CSV files which don't contain any) or tell you
if the input file is compatible with that dialect.

Skip
_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From skip at pobox.com  Wed Jan 29 17:21:42 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 29 Jan 2003 10:21:42 -0600
Subject: [Csv] Re: First Cut at CSV PEP
In-Reply-To: <m3vg087hkd.fsf@ferret.object-craft.com.au>
References: <15926.1287.36487.12649@montanaro.dyndns.org>
        <m3d6mhrhx4.fsf@ferret.object-craft.com.au>
        <1043788652.25139.3222.camel@software1.logiplex.internal>
        <m3znpkrgbe.fsf@ferret.object-craft.com.au>
        <15927.15008.879314.896465@montanaro.dyndns.org>
        <m3vg087hkd.fsf@ferret.object-craft.com.au>
Message-ID: <15927.65430.961029.406378@montanaro.dyndns.org>

    Skip> Any of:

    Skip> reader = csv.reader(file("some.csv"),
    Skip>                     variant="excel2000-tsv")

    Dave> Are you suggesting that each dialect have a collection of
    Dave> variants?

Nope.  "variant" was a mistake there.  Should have been "dialect".  Dialect
names are just strings which map to either classes or dicts.

Skip
_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From skip at pobox.com  Wed Jan 29 17:27:59 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 29 Jan 2003 10:27:59 -0600
Subject: [Csv] Re: First Cut at CSV PEP
In-Reply-To: <m3r8aw7h6h.fsf@ferret.object-craft.com.au>
References: <15926.1287.36487.12649@montanaro.dyndns.org>
        <KJEOLDOPMIDKCMJDCNDPMEPFCMAA.altis@semi-retired.com>
        <15926.64576.481489.373053@montanaro.dyndns.org>
        <m3lm14rfkf.fsf@ferret.object-craft.com.au>
        <15927.15551.93504.635849@montanaro.dyndns.org>
        <m3r8aw7h6h.fsf@ferret.object-craft.com.au>
Message-ID: <15928.271.283784.851985@montanaro.dyndns.org>

    Dave> This is what I was trying to say:

    >>> w = csv.writer(sys.stdio)
    >>> w.write(['','hello',None])
    ',hello,\n'
    >>> r = csv.reader(StringIO('None,hello,'))
    >>> for l in csv: print r
    ['None','hello','']

    Skip> I think we need to limit the data which can be output to strings,
    Skip> Unicode strings (if we use an encoded stream), floats and ints.
    Skip> Anything else should raise TypeError.

    Dave> Is there any merit having the writer handling non-string data by
    Dave> producing an empty field for None, and the result of
    Dave> PyObject_Str() for all other values?

We could do like some of the DB API modules do and provide mappings which
take the types of objects and see if a function exists to handle that type.
If so, whatever that function returns would be what was written.  This could
handle the case of None (allowing the user to specify how it was mapped),
but could also be used to massage data of known type (for example, to round
all floats to two decimal places).

I think this sort of capability should wait until the second generation
though. 

    Skip> I just tried "never" while saving CSV data from Gnumeric.  It
    Skip> didn't escape embedded commas, so it effectively toasted the data.

    Dave> I have seen that happen in other applications.

Needless to say, our csv module should *not* do that.  Fried data, when
accompanied by angry mobs, doesn't taste too good.  If the user specifies
"never", I think an exception should be raised if no escape character is
defined and fields containing the delimiter are encountered.

Skip
_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From skip at pobox.com  Wed Jan 29 17:54:55 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 29 Jan 2003 10:54:55 -0600
Subject: [Csv] Re: python/nondist/peps pep-0305.txt,1.2,1.3
In-Reply-To: <BA5D5538.2F986%goodger@python.org>
References: <BA5D5538.2F986%goodger@python.org>
Message-ID: <15928.1887.707385.352015@montanaro.dyndns.org>


    >> Changed Type to Standards Track.

    David> I believe this PEP is Informational, not Standards Track.  

Yes, but it's also the working document for the csv module currently
gestating in the sandbox, and which we hope to get into Python 2.3.

Skip
_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From LogiplexSoftware at earthlink.net  Wed Jan 29 17:58:37 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 29 Jan 2003 08:58:37 -0800
Subject: [Csv] Devil in the details, including the small one between delimiters
 and quotechars
Message-ID: <1043859517.16012.14.camel@software1.logiplex.internal>

Okay, despite claims to the contrary, Pure Evil can in fact be broken
down into little bits and stored in ASCII files.

This spaces around quoted data bit is starting to bother me.  Consider
the following:

1, "not quoted","quoted"

It seems reasonable to parse this as:

[1, ' "not quoted"', "quoted"]

which is the described Excel behavior.

Now consider

1,"not quoted" ,"quoted"

Is the second field quoted or not?  If it is, do we discard the
extraneous whitespace following it or raise an exception?

Worse, consider this

"quoted", "not quoted, but this ""field"" has delimiters and quotes"

How should this parse?  I say free exceptions for everyone.

While we're on the topic, I heard back from my DSV user who had
mentioned this corner case of spaces between delimiters and quotes and
he admitted that the files were created by hand, by him (figures), he
seems to recall some now forgotten application that may have done this
but wasn't sure.  His memory was vague on whether he saw it on a PC or
in a barn eating hay.

I propose space between delimiters and quotes raise an exception and
let's be done with it.  I don't think this really affects Excel
compatibility since Excel will never generate this type of file and
doesn't require it for import.  It's true that some files that Excel
would import (probably incorrectly) won't import in CSV, but I think
that's outside the scope of Excel compatibility.


Anyway, I know no one has said "On your mark, get set" yet, but I can't
think without code sitting in front of me, breaking worse with every
keystroke, so in addition to creating some test cases, I've hacked up a
very preliminary CSV module so we have something to play with.  I was up
til 6am so if there's anything odd, I blame it on lack of sleep and the
feverish optimism and glossing of detail that comes with it.  Note that
while the entire test.csv gets imported without exception, the last few
lines aren't parsed correctly.  At least, I don't think they are.  I
can't remember now.  Also, this code is based upon what was discussed up
until yesterday when I went home, so recent conversations may not be
reflected.

Mercilessly disect away.

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308
-------------- next part --------------
A non-text attachment was scrubbed...
Name: CSV.py
Type: text/x-python
Size: 5570 bytes
Desc: not available
Url : http://mail.python.org/pipermail/csv/attachments/20030129/d80b9ba0/attachment.py 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test.csv
Type: text/x-comma-separated-values
Size: 720 bytes
Desc: not available
Url : http://mail.python.org/pipermail/csv/attachments/20030129/d80b9ba0/attachment.bin 

From skip at pobox.com  Wed Jan 29 18:17:53 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 29 Jan 2003 11:17:53 -0600
Subject: [Csv] CSV interface question
In-Reply-To: <m3adhk9qmh.fsf@ferret.object-craft.com.au>
References: <20030129060501.DB9193C1F4@coffee.object-craft.com.au>
        <20030129101658.4E95C3C1F4@coffee.object-craft.com.au>
        <m3adhk9qmh.fsf@ferret.object-craft.com.au>
Message-ID: <15928.3265.630020.528438@montanaro.dyndns.org>


    Andrew> csv.reader(fileobj, csv.dialect.excel)

    Andrew> Thoughts?

    Dave> Is there a downside to this?  I can't see one immediately.

With the dialect concept all we are talking about is a collection of
parameter settings.  Encapsulating that as subclasses seems like it hides
the data-oriented nature behind the facade of source code.

Skip
_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From LogiplexSoftware at earthlink.net  Wed Jan 29 18:31:02 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 29 Jan 2003 09:31:02 -0800
Subject: [Csv] CSV interface question
In-Reply-To: <m3adhk9qmh.fsf@ferret.object-craft.com.au>
References: <20030129060501.DB9193C1F4@coffee.object-craft.com.au>
	 <20030129101658.4E95C3C1F4@coffee.object-craft.com.au>
	 <m3adhk9qmh.fsf@ferret.object-craft.com.au>
Message-ID: <1043861462.16012.46.camel@software1.logiplex.internal>

On Wed, 2003-01-29 at 02:50, Dave Cole wrote:
> >>>>> "Andrew" == Andrew McNamara <andrewm at object-craft.com.au> writes:
> 
> >> Maybe we should consider a "container of options" class (of which
> >> the dialects would be subclasses). The sniffing code could then
> >> return an instance of this class (which wouldn't necessarily be a
> >> dialect). With this, you might do things like:
> 
> Andrew> Another thought - rather than specify the dialect name as a
> Andrew> string, it could be specified as a class or instance -
> Andrew> something like:
> 
> Andrew>     csv.reader(fileobj, csv.dialect.excel)
> 
> Andrew> Thoughts?
> 
> Is there a downside to this?  I can't see one immediately.

Actually, there is a downside to using strings, as you will see if you
look at the code I posted a little while ago.  By taking dialect as a
string, it basically precludes the user rolling their own dialect except
as keyword arguments.  After working on this, I'm inclined to have the
programmer pass a class or other structure.

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308

_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From skip at pobox.com  Wed Jan 29 18:31:31 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 29 Jan 2003 11:31:31 -0600
Subject: [Csv] Devil in the details, including the small one between
        delimiters and quotechars
In-Reply-To: <1043859517.16012.14.camel@software1.logiplex.internal>
References: <1043859517.16012.14.camel@software1.logiplex.internal>
Message-ID: <15928.4083.834299.369381@montanaro.dyndns.org>


    Cliff> Now consider

    Cliff> 1,"not quoted" ,"quoted"

    Cliff> Is the second field quoted or not?  If it is, do we discard the
    Cliff> extraneous whitespace following it or raise an exception?

Well, there's always the, "be flexible in what you accept, strict in what
you generate" school of thought.  In the above, that would suggest the
list returned would be

    ['1', 'not quoted', 'quoted']

It seems like a minor formatting glitch.  How about a warning?  Or a "strict"
flag for the parser?

    Cliff> Worse, consider this

    Cliff> "quoted", "not quoted, but this ""field"" has delimiters and quotes"

Depends on the setting of skipinitialspaces.  If false, you get

    ['quoted', ' "not quoted', ' but this ""field"" has delimiters and quotes"']

if True, I think you get

    ['quoted', 'not quoted, but this "field" has delimiters and quotes']

    Cliff> How should this parse?  I say free exceptions for everyone.

    Cliff> While we're on the topic, I heard back from my DSV user who had
    Cliff> mentioned this corner case of spaces between delimiters and
    Cliff> quotes and he admitted that the files were created by hand, by
    Cliff> him (figures), he seems to recall some now forgotten application
    Cliff> that may have done this but wasn't sure.  His memory was vague on
    Cliff> whether he saw it on a PC or in a barn eating hay.

Don't you just love customers with concrete requirements? ;-)

    Cliff> I propose space between delimiters and quotes raise an exception
    Cliff> and let's be done with it.  I don't think this really affects
    Cliff> Excel compatibility since Excel will never generate this type of
    Cliff> file and doesn't require it for import.  It's true that some
    Cliff> files that Excel would import (probably incorrectly) won't import
    Cliff> in CSV, but I think that's outside the scope of Excel
    Cliff> compatibility.

Sounds good to me.

    Cliff> Anyway, I know no one has said "On your mark, get set" yet, but I
    Cliff> can't think without code sitting in front of me, breaking worse
    Cliff> with every keystroke, so in addition to creating some test cases,
    Cliff> I've hacked up a very preliminary CSV module so we have something
    Cliff> to play with.  I was up til 6am so if there's anything odd, I
    Cliff> blame it on lack of sleep and the feverish optimism and glossing
    Cliff> of detail that comes with it.

Perhaps you and Dave were in a race but didn't know it? ;-)

Skip
_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From skip at pobox.com  Wed Jan 29 18:41:07 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 29 Jan 2003 11:41:07 -0600
Subject: [Csv] CSV interface question
In-Reply-To: <1043861462.16012.46.camel@software1.logiplex.internal>
References: <20030129060501.DB9193C1F4@coffee.object-craft.com.au>
        <20030129101658.4E95C3C1F4@coffee.object-craft.com.au>
        <m3adhk9qmh.fsf@ferret.object-craft.com.au>
        <1043861462.16012.46.camel@software1.logiplex.internal>
Message-ID: <15928.4659.449989.410123@montanaro.dyndns.org>


    Cliff> Actually, there is a downside to using strings, as you will see
    Cliff> if you look at the code I posted a little while ago.  By taking
    Cliff> dialect as a string, it basically precludes the user rolling
    Cliff> their own dialect except as keyword arguments.  After working on
    Cliff> this, I'm inclined to have the programmer pass a class or other
    Cliff> structure.

Don't forget we have the speedy Object Craft _csv engine sitting underneath
the covers.  Under the assumption that all the actual processing goes on at
that level, I see no particular reason dialect info needs to be anything
other than a collection of keyword arguments.  I view csv.reader and
csv.writer as factory functions which return functional readers and writers
defined in _csv.c.  The Python level serves simply to paper over the
low-level extension module.

Skip
_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From skip at pobox.com  Wed Jan 29 18:46:01 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 29 Jan 2003 11:46:01 -0600
Subject: [CSV] Number of lines in CSV files
Message-ID: <15928.4953.211377.214912@montanaro.dyndns.org>


An oldish message which got snared by my laptop's mobility...
From: Skip Montanaro <skip at pobox.com>
To: Cliff Wells <LogiplexSoftware at earthlink.net>
Cc: Kevin Altis <altis at semi-retired.com>, csv at object-craft.com.au
Subject: Re: [CSV] Number of lines in CSV files
Date: Tue, 28 Jan 2003 18:48:56 -0600
Reply-To: skip at pobox.com


    Cliff> Since export will be a feature of the CSV module, should we have
    Cliff> some sort of warning or raise an exception when exporting data
    Cliff> larger than the target application can handle, or should we just
    Cliff> punt on this?

Punt.  At most I would put it in a separate csvutils module such as Dave
suggested.

Skip
_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From LogiplexSoftware at earthlink.net  Wed Jan 29 19:08:24 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 29 Jan 2003 10:08:24 -0800
Subject: [Csv] CSV interface question
In-Reply-To: <15928.4659.449989.410123@montanaro.dyndns.org>
References: <20030129060501.DB9193C1F4@coffee.object-craft.com.au>
	 <20030129101658.4E95C3C1F4@coffee.object-craft.com.au>
	 <m3adhk9qmh.fsf@ferret.object-craft.com.au>
	 <1043861462.16012.46.camel@software1.logiplex.internal>
	 <15928.4659.449989.410123@montanaro.dyndns.org>
Message-ID: <1043863704.16012.64.camel@software1.logiplex.internal>

On Wed, 2003-01-29 at 09:41, Skip Montanaro wrote:
>     Cliff> Actually, there is a downside to using strings, as you will see
>     Cliff> if you look at the code I posted a little while ago.  By taking
>     Cliff> dialect as a string, it basically precludes the user rolling
>     Cliff> their own dialect except as keyword arguments.  After working on
>     Cliff> this, I'm inclined to have the programmer pass a class or other
>     Cliff> structure.
> 
> Don't forget we have the speedy Object Craft _csv engine sitting underneath
> the covers.  Under the assumption that all the actual processing goes on at
> that level, I see no particular reason dialect info needs to be anything
> other than a collection of keyword arguments.  

You've lost me, I'm afraid.  What I'm saying is that:

csvreader = reader(file("test_data/sfsample.csv", 'r'), dialect='excel')

isn't as flexible as

csvreader = reader(file("test_data/sfsample.csv", 'r'), dialect=excel)

where excel is either a pre-defined dictionary/class or a user-created
dictionary/class.  

As an aside, I prefer using a class as it allows for validating the
dialect settings from the dialect object itself (see the CSV.py I posted
earlier).

> I view csv.reader and
> csv.writer as factory functions which return functional readers and writers
> defined in _csv.c.  The Python level serves simply to paper over the
> low-level extension module.

That's what I see also (even though the CSV.py I posted earlier doesn't
exactly follow that convention).  I do think we need a pure Python
alternative to the C module, but both of them should be exposed via a
higher-level interface.  Unfortunately, I'm still not mentally linking
this with my earlier point =)

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308

_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From skip at pobox.com  Wed Jan 29 19:17:39 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 29 Jan 2003 12:17:39 -0600
Subject: [Csv] CSV interface question
In-Reply-To: <1043863704.16012.64.camel@software1.logiplex.internal>
References: <20030129060501.DB9193C1F4@coffee.object-craft.com.au>
        <20030129101658.4E95C3C1F4@coffee.object-craft.com.au>
        <m3adhk9qmh.fsf@ferret.object-craft.com.au>
        <1043861462.16012.46.camel@software1.logiplex.internal>
        <15928.4659.449989.410123@montanaro.dyndns.org>
        <1043863704.16012.64.camel@software1.logiplex.internal>
Message-ID: <15928.6851.934680.995625@montanaro.dyndns.org>


    Cliff> You've lost me, I'm afraid.  What I'm saying is that:

    Cliff> csvreader = reader(file("test_data/sfsample.csv", 'r'),
    Cliff>                    dialect='excel')

    Cliff> isn't as flexible as

    Cliff> csvreader = reader(file("test_data/sfsample.csv", 'r'),
    Cliff>                    dialect=excel)

    Cliff> where excel is either a pre-defined dictionary/class or a
    Cliff> user-created dictionary/class.

Yes, but my string just indexes into a mapping to get to the real dict which
stores the parameter settings, as I indicated in an earlier post:

    I was thinking of dialects as dicts.  You'd have

        excel_dialect = { "quotechar": '"',
                          "delimiter": ',',
                          "linetermintor": '\r\n',
                          ...
                        }

    with a corresponding mapping as you suggested:

        settings = { 'excel': excel_dialect,
                     'excel-tsv: excel_tabs_dialect, }

    then in the factory functions do something like:

        def reader(fileobj, dialect="excel", **kwds):
            kwargs = copy.copy(settings[dialect])
            kwargs.update(kwds)
            # possible sanity check on kwargs here ...
            return _csv.reader(fileobj, **kwargs)

Did that not make it out?  I also think it's cleaner if we have a data file
which is loaded at import time to define the various dialects.  That way we
aren't mixing too much data into our code.  It also opens up the opportunity
for users to later specify their own dialect data files.  Where I indicated
"possible sanity check" above would be a call to a validation function on
the settings.

Skip
_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From LogiplexSoftware at earthlink.net  Wed Jan 29 20:18:16 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 29 Jan 2003 11:18:16 -0800
Subject: [Csv] CSV interface question
In-Reply-To: <15928.6851.934680.995625@montanaro.dyndns.org>
References: <20030129060501.DB9193C1F4@coffee.object-craft.com.au>
	 <20030129101658.4E95C3C1F4@coffee.object-craft.com.au>
	 <m3adhk9qmh.fsf@ferret.object-craft.com.au>
	 <1043861462.16012.46.camel@software1.logiplex.internal>
	 <15928.4659.449989.410123@montanaro.dyndns.org>
	 <1043863704.16012.64.camel@software1.logiplex.internal>
	 <15928.6851.934680.995625@montanaro.dyndns.org>
Message-ID: <1043867895.16012.87.camel@software1.logiplex.internal>

On Wed, 2003-01-29 at 10:17, Skip Montanaro wrote:
>     Cliff> You've lost me, I'm afraid.  What I'm saying is that:
> 
>     Cliff> csvreader = reader(file("test_data/sfsample.csv", 'r'),
>     Cliff>                    dialect='excel')
> 
>     Cliff> isn't as flexible as
> 
>     Cliff> csvreader = reader(file("test_data/sfsample.csv", 'r'),
>     Cliff>                    dialect=excel)
> 
>     Cliff> where excel is either a pre-defined dictionary/class or a
>     Cliff> user-created dictionary/class.
> 
> Yes, but my string just indexes into a mapping to get to the real dict which
> stores the parameter settings, as I indicated in an earlier post:
> 
>     I was thinking of dialects as dicts.  You'd have
> 
        excel_dialect = { "quotechar": '"',
>                           "delimiter": ',',
>                           "linetermintor": '\r\n',
>                           ...
>                         }
> 
>     with a corresponding mapping as you suggested:
> 
>         settings = { 'excel': excel_dialect,
>                      'excel-tsv: excel_tabs_dialect, }
> 
>     then in the factory functions do something like:
> 
>         def reader(fileobj, dialect="excel", **kwds):
>             kwargs = copy.copy(settings[dialect])
>             kwargs.update(kwds)
>             # possible sanity check on kwargs here ...
>             return _csv.reader(fileobj, **kwargs)

I understand this, but I think you miss my point (or I missed you with
it ;)  Consider now the programmer actually defining a new dialect:
Passing a class or other structure (a dict is fine), they can create
this on the fly with minimal work.  Using a *string*, they must first
"register" that string somewhere (probably in the mapping we agree upon)
before they can actually make the function call.  Granted, it's only a
an extra step, but it requires a bit more knowledge (of the mapping) and
doesn't seem to provide a real benefit.  If you prefer a mapping to a
class, that is fine, but lets pass the mapping rather than a string
referring to it:

       excel_dialect = { "quotechar": '"',
                         "delimiter": ',',
                         "linetermintor": '\r\n',
                         ...
                        }

       settings = { 'excel': excel,
                    'excel-tsv: excel_tabs, }

       def reader(fileobj, dialect=excel, **kwds):
           kwargs = copy.copy(dialect)
           kwargs.update(kwds)
           # possible sanity check on kwargs here ...
           return _csv.reader(fileobj, **kwargs)

This allows the user to do such things as:

mydialect = { ... }
reader(fileobj, mydialect, ...)

rather than

mydialect = { ... }
settings['mydialect'] = mydialect
reader(fileobj, 'mydialect', ...)

To use the settings table for getting a default, they can still use
reader(fileobj, settings['excel-tsv'], ...)

or just use the excel settings directly:
reader(fileobj, excel_tsv, ...)

(BTW, I prefer 'dialects' to 'settings' for the mapping name, just for consistency).

I'll grant that the difference is small, but it still requires one extra
line and one extra piece of knowledge with no real benefit to the
programmer, AFAICT.  If you don't agree I'll let it pass as it *is* a
relatively minor difference.

> Did that not make it out?  I also think it's cleaner if we have a data file
> which is loaded at import time to define the various dialects.  That way we
> aren't mixing too much data into our code.  It also opens up the opportunity
> for users to later specify their own dialect data files.  Where I indicated
> "possible sanity check" above would be a call to a validation function on
> the settings.

+1 on this, but only if you cave on the other one <wink>


-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308

_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From djc at object-craft.com.au  Thu Jan 30 00:15:23 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 30 Jan 2003 10:15:23 +1100
Subject: [Csv] CSV interface question
In-Reply-To: <15928.6851.934680.995625@montanaro.dyndns.org>
References: <20030129060501.DB9193C1F4@coffee.object-craft.com.au>
	<20030129101658.4E95C3C1F4@coffee.object-craft.com.au>
	<m3adhk9qmh.fsf@ferret.object-craft.com.au>
	<1043861462.16012.46.camel@software1.logiplex.internal>
	<15928.4659.449989.410123@montanaro.dyndns.org>
	<1043863704.16012.64.camel@software1.logiplex.internal>
	<15928.6851.934680.995625@montanaro.dyndns.org>
Message-ID: <m3bs1z5z04.fsf@ferret.object-craft.com.au>


Cliff> You've lost me, I'm afraid.  What I'm saying is that:

Cliff> csvreader = reader(file("test_data/sfsample.csv", 'r'),
Cliff>                    dialect='excel')

Cliff> isn't as flexible as

Cliff> csvreader = reader(file("test_data/sfsample.csv", 'r'),
Cliff>                    dialect=excel)

Cliff> where excel is either a pre-defined dictionary/class or a
Cliff> user-created dictionary/class.

Skip> Yes, but my string just indexes into a mapping to get to the
Skip> real dict which stores the parameter settings, as I indicated in
Skip> an earlier post:
Skip> 
Skip>     I was thinking of dialects as dicts.  You'd have
Skip> 
Skip>         excel_dialect = { "quotechar": '"',
Skip>                           "delimiter": ',',
Skip>                           "linetermintor": '\r\n',
Skip>                           ...
Skip>                         }

Note the spelling error in "linetermintor" - user constructed
dictionaries are not good.

Whenever I find myself using dictionaries for storing values as
opposed to indexing data I can't escape the feeling that my past as a
Perl programmer is coming back to haunt me.  At least with Perl there
is some syntactic sugar to make this type of thing less ugly:

excel_dialect = { quotechar => '"',
                  delimiter => ',',
                  linetermintor => '\r\n' }

In the absence of that sugar I would prefer something like the
following:

class excel:
    quotechar = '"'
    delimiter = ','
    linetermintor = '\r\n'

settings = {}
for dialect in (excel, exceltsv):
    settings[dialect.__name__] = dialect

Maybe we could include a name attribute which allowed us to use
'excel-tsv' as a dialect identifier.

Skip>     with a corresponding mapping as you suggested:
Skip> 
Skip>         settings = { 'excel': excel_dialect,
Skip>                      'excel-tsv: excel_tabs_dialect, }
Skip> 
Skip>     then in the factory functions do something like:
Skip> 
Skip>         def reader(fileobj, dialect="excel", **kwds):
Skip>             kwargs = copy.copy(settings[dialect])
Skip>             kwargs.update(kwds)
Skip>             # possible sanity check on kwargs here ...
Skip>             return _csv.reader(fileobj, **kwargs)

With the class technique this would become:

def reader(fileobj, dialect=excel, **kwds):
    kwargs = {}
    for key, value in dialect.__dict__.iteritems():
        if not key.startswith('_'):
            kwargs[key] = value
    kwargs.update(kwds)
    return _csv.reader(fileobj, **kwargs)

Skip> Did that not make it out?  I also think it's cleaner if we have
Skip> a data file which is loaded at import time to define the various
Skip> dialects.  That way we aren't mixing too much data into our
Skip> code.  It also opens up the opportunity for users to later
Skip> specify their own dialect data files.  Where I indicated
Skip> "possible sanity check" above would be a call to a validation
Skip> function on the settings.

Hmmm...  Hard and messy to define classes on the fly.  Then we are
back to some kind of dialect object.

class dialect:
    def __init__(self, quotechar='"', delimiter=',', lineterminator='\r\n'):
        self.quotechar = quotechar
        self.delimiter = delimiter
        self.lineterminator = lineterminator

settings = { 'excel': dialect(),
             'excel-tsv': dialect(delimiter='\t') }

def add_dialect(name, dialect):
    settings[name] = dialect

def reader(fileobj, args='excel', **kwds):
    kwargs = {}
    if not isinstance(args, dialect):
        dialect = settings[args]
    kwargs.update(name.__dict__)
    kwargs.update(kwds)
    return _csv.reader(fileobj, **kwargs)

This would then allow you to extend the settings dictionary on the
fly, or simply pass your own dialect object.

>>> import csv
>>> my_dialect = csv.dialect(lineterminator = '\f')
>>> rdr = csv.reader(file('blah.csv'), my_dialect)

- Dave

-- 
http://www.object-craft.com.au

_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From djc at object-craft.com.au  Thu Jan 30 00:16:57 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 30 Jan 2003 10:16:57 +1100
Subject: [Csv] CSV interface question
In-Reply-To: <1043867895.16012.87.camel@software1.logiplex.internal>
References: <20030129060501.DB9193C1F4@coffee.object-craft.com.au>
	<20030129101658.4E95C3C1F4@coffee.object-craft.com.au>
	<m3adhk9qmh.fsf@ferret.object-craft.com.au>
	<1043861462.16012.46.camel@software1.logiplex.internal>
	<15928.4659.449989.410123@montanaro.dyndns.org>
	<1043863704.16012.64.camel@software1.logiplex.internal>
	<15928.6851.934680.995625@montanaro.dyndns.org>
	<1043867895.16012.87.camel@software1.logiplex.internal>
Message-ID: <m37kcn5yxi.fsf@ferret.object-craft.com.au>

>>>>> "Cliff" == Cliff Wells <LogiplexSoftware at earthlink.net> writes:

>> Did that not make it out?  I also think it's cleaner if we have a
>> data file which is loaded at import time to define the various
>> dialects.  That way we aren't mixing too much data into our code.
>> It also opens up the opportunity for users to later specify their
>> own dialect data files.  Where I indicated "possible sanity check"
>> above would be a call to a validation function on the settings.

Cliff> +1 on this, but only if you cave on the other one <wink>

LOL.  Have you considered a career as a politician?

- Dave

-- 
http://www.object-craft.com.au

_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From djc at object-craft.com.au  Thu Jan 30 00:19:32 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 30 Jan 2003 10:19:32 +1100
Subject: [Csv] Sniffing dialects
In-Reply-To: <15927.65132.432457.594501@montanaro.dyndns.org>
References: <15927.65132.432457.594501@montanaro.dyndns.org>
Message-ID: <m33cnb5yt7.fsf@ferret.object-craft.com.au>

>>>>> "Skip" == Skip Montanaro <skip at pobox.com> writes:

Skip> If my notion of dialects as dicts isn't too far off-base, the
Skip> sniffing code could just return a dict.  That would be a good
Skip> way to define new dialects.  Someone could send us a CSV file
Skip> from a particular application.  We'd turn the sniffer loose on
Skip> it then append the result to our dialects.csv file.

I am all for dialects as attribute only objects.  You get the same
effect as a dict but with less Perlish syntax.

Skip> (A different version of) the sniffer could take an optional
Skip> dialect string as an arg and either use it as the starting point
Skip> (for stuff it can't discern, like hard returns in CSV files
Skip> which don't contain any) or tell you if the input file is
Skip> compatible with that dialect.

+1

- Dave

-- 
http://www.object-craft.com.au

_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From djc at object-craft.com.au  Thu Jan 30 00:25:23 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 30 Jan 2003 10:25:23 +1100
Subject: [Csv] Re: First Cut at CSV PEP
In-Reply-To: <15928.271.283784.851985@montanaro.dyndns.org>
References: <15926.1287.36487.12649@montanaro.dyndns.org>
	<KJEOLDOPMIDKCMJDCNDPMEPFCMAA.altis@semi-retired.com>
	<15926.64576.481489.373053@montanaro.dyndns.org>
	<m3lm14rfkf.fsf@ferret.object-craft.com.au>
	<15927.15551.93504.635849@montanaro.dyndns.org>
	<m3r8aw7h6h.fsf@ferret.object-craft.com.au>
	<15928.271.283784.851985@montanaro.dyndns.org>
Message-ID: <m3y9534jz0.fsf@ferret.object-craft.com.au>

>>>>> "Skip" == Skip Montanaro <skip at pobox.com> writes:

Dave> This is what I was trying to say:
>>>> w = csv.writer(sys.stdio) w.write(['','hello',None])
Skip>     ',hello,\n'
>>>> r = csv.reader(StringIO('None,hello,')) for l in csv: print r
Skip>     ['None','hello','']

Skip> I think we need to limit the data which can be output to
Skip> strings, Unicode strings (if we use an encoded stream), floats
Skip> and ints.  Anything else should raise TypeError.

Dave> Is there any merit having the writer handling non-string data by
Dave> producing an empty field for None, and the result of
Dave> PyObject_Str() for all other values?

Skip> We could do like some of the DB API modules do and provide
Skip> mappings which take the types of objects and see if a function
Skip> exists to handle that type.  If so, whatever that function
Skip> returns would be what was written.  This could handle the case
Skip> of None (allowing the user to specify how it was mapped), but
Skip> could also be used to massage data of known type (for example,
Skip> to round all floats to two decimal places).

Skip> I think this sort of capability should wait until the second
Skip> generation though.

I think this would make things too slow.  The Python core already has
a convenience function for doing the necessary conversion;
PyObject_Str().

If we are in a hurry we could document the existing low level writer
behaviour which is to invoke PyObject_Str() for all non-string values
except None.  None is translated to ''.

Skip> I just tried "never" while saving CSV data from Gnumeric.  It
Skip> didn't escape embedded commas, so it effectively toasted the
Skip> data.

Dave> I have seen that happen in other applications.

Skip> Needless to say, our csv module should *not* do that.  Fried
Skip> data, when accompanied by angry mobs, doesn't taste too good.
Skip> If the user specifies "never", I think an exception should be
Skip> raised if no escape character is defined and fields containing
Skip> the delimiter are encountered.

Should the _csv parser should sanity check the combination of options
in the constructor, or when told to write data which is broken?

It is possible to define no quote or escape character but still write
valid data.

  1,2,3,4

- Dave

-- 
http://www.object-craft.com.au

_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From andrewm at object-craft.com.au  Thu Jan 30 00:43:45 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Thu, 30 Jan 2003 10:43:45 +1100
Subject: [Csv] CSV interface question 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15928.3265.630020.528438@montanaro.dyndns.org> 
References: <20030129060501.DB9193C1F4@coffee.object-craft.com.au>
	<20030129101658.4E95C3C1F4@coffee.object-craft.com.au>
	<m3adhk9qmh.fsf@ferret.object-craft.com.au>
	<15928.3265.630020.528438@montanaro.dyndns.org> 
Message-ID: <20030129234345.3CE6D3C32B@coffee.object-craft.com.au>

>    Andrew> csv.reader(fileobj, csv.dialect.excel)
>
>    Andrew> Thoughts?
>
>    Dave> Is there a downside to this?  I can't see one immediately.
>
>With the dialect concept all we are talking about is a collection of
>parameter settings.  Encapsulating that as subclasses seems like it hides
>the data-oriented nature behind the facade of source code.

It has the virtue that sub-classing can be used to represent related
variants. So, excel-tab might be:

        class excel-tab(excel):
            delimiter = '\t'

This could also be useful for users of the module:

        class funky(excel):
            quotes = "'"

Essentially we'd be using classes as glorified dictionaries with
cascading values.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/
_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From andrewm at object-craft.com.au  Thu Jan 30 00:45:26 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Thu, 30 Jan 2003 10:45:26 +1100
Subject: [Csv] CSV interface question 
In-Reply-To: Message from Dave Cole <djc@object-craft.com.au> 
	<m3bs1z5z04.fsf@ferret.object-craft.com.au> 
References: <20030129060501.DB9193C1F4@coffee.object-craft.com.au>
	<20030129101658.4E95C3C1F4@coffee.object-craft.com.au>
	<m3adhk9qmh.fsf@ferret.object-craft.com.au>
	<1043861462.16012.46.camel@software1.logiplex.internal>
	<15928.4659.449989.410123@montanaro.dyndns.org>
	<1043863704.16012.64.camel@software1.logiplex.internal>
	<15928.6851.934680.995625@montanaro.dyndns.org>
	<m3bs1z5z04.fsf@ferret.object-craft.com.au> 
Message-ID: <20030129234526.2B1943C32B@coffee.object-craft.com.au>

>With the class technique this would become:
>
>def reader(fileobj, dialect=excel, **kwds):
>    kwargs = {}
>    for key, value in dialect.__dict__.iteritems():
>        if not key.startswith('_'):
>            kwargs[key] = value
>    kwargs.update(kwds)
>    return _csv.reader(fileobj, **kwargs)

BTW, your method of extracting directly from the instance's __dict__
doesn't pick up class attributes. In my prototype implementation, I used
getattr instead.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/
_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From djc at object-craft.com.au  Thu Jan 30 00:49:44 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 30 Jan 2003 10:49:44 +1100
Subject: [Csv] CSV interface question
In-Reply-To: <20030129234345.3CE6D3C32B@coffee.object-craft.com.au>
References: <20030129060501.DB9193C1F4@coffee.object-craft.com.au>
	<20030129101658.4E95C3C1F4@coffee.object-craft.com.au>
	<m3adhk9qmh.fsf@ferret.object-craft.com.au>
	<15928.3265.630020.528438@montanaro.dyndns.org>
	<20030129234345.3CE6D3C32B@coffee.object-craft.com.au>
Message-ID: <m3ptqf4iuf.fsf@ferret.object-craft.com.au>

>>>>> "Andrew" == Andrew McNamara <andrewm at object-craft.com.au> writes:

Andrew> csv.reader(fileobj, csv.dialect.excel)
>>
Andrew> Thoughts?
>>
Dave> Is there a downside to this?  I can't see one immediately.
>>  With the dialect concept all we are talking about is a collection
>> of parameter settings.  Encapsulating that as subclasses seems like
>> it hides the data-oriented nature behind the facade of source code.

Andrew> It has the virtue that sub-classing can be used to represent
Andrew> related variants. So, excel-tab might be:

Andrew>         class excel-tab(excel):
Andrew>             delimiter = '\t'

Not sure the python interpreter will like that class name :-)

Andrew> This could also be useful for users of the module:

Andrew>         class funky(excel):
Andrew>             quotes = "'"

Andrew> Essentially we'd be using classes as glorified dictionaries
Andrew> with cascading values.

I would prefer attribute only flat objects.  The alternative would
have us traversing inheritance trees to extract class dictionaries.

class dialect:
    def __init__(self, delimiter=',', ...):
        self.delimiter = delimiter
        :

>>> funky = csv.copy_dialect('excel')
>>> funky.quotes = "'"

Not as nice as subclassing, but probably good enough.

- Dave

-- 
http://www.object-craft.com.au


From djc at object-craft.com.au  Thu Jan 30 00:57:57 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 30 Jan 2003 10:57:57 +1100
Subject: [Csv] CSV interface question
In-Reply-To: <20030129234526.2B1943C32B@coffee.object-craft.com.au>
References: <20030129060501.DB9193C1F4@coffee.object-craft.com.au>
	<20030129101658.4E95C3C1F4@coffee.object-craft.com.au>
	<m3adhk9qmh.fsf@ferret.object-craft.com.au>
	<1043861462.16012.46.camel@software1.logiplex.internal>
	<15928.4659.449989.410123@montanaro.dyndns.org>
	<1043863704.16012.64.camel@software1.logiplex.internal>
	<15928.6851.934680.995625@montanaro.dyndns.org>
	<m3bs1z5z04.fsf@ferret.object-craft.com.au>
	<20030129234526.2B1943C32B@coffee.object-craft.com.au>
Message-ID: <m3k7gn4igq.fsf@ferret.object-craft.com.au>

>>>>> "Andrew" == Andrew McNamara <andrewm at object-craft.com.au> writes:

>> With the class technique this would become:
>> 
>> def reader(fileobj, dialect=excel, **kwds):
>>     kwargs = {}
>>     for key, value in dialect.__dict__.iteritems():
>>         if not key.startswith('_'):
>>             kwargs[key] = value
>>     kwargs.update(kwds)
>>     return _csv.reader(fileobj, **kwargs)

Andrew> BTW, your method of extracting directly from the instance's
Andrew> __dict__ doesn't pick up class attributes. In my prototype
Andrew> implementation, I used getattr instead.

Ahhh...

So does this mean that we can go back to classes?

class dialect:
    quotechar = '"'
    delimiter = ','
    lineterminator = '\r\n'

dialect_opts = [attr for attr in dir(dialect) if not attr.startswith('_')]

excel = dialect

class excel_tsv(excel):
    delimiter = '\t'

def reader(fileobj, dialectobj=excel, **kwds):
    kwargs = {}
    for opt in dialect_opts:
        kwargs[opt] = getattr(dialectobj, opt)
    kwargs.update(kwds)
    return _csv.reader(fileobj, **kwargs)

-- 
http://www.object-craft.com.au


From skip at pobox.com  Thu Jan 30 02:47:20 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 29 Jan 2003 19:47:20 -0600
Subject: [Csv] CSV interface question
In-Reply-To: <20030129060501.DB9193C1F4@coffee.object-craft.com.au>
References: <20030129060501.DB9193C1F4@coffee.object-craft.com.au>
Message-ID: <15928.33832.843555.867347@montanaro.dyndns.org>


    Andrew> In the proposed PEP, we have separate instances for reading and
    Andrew> writing.  In the Object Craft csv module, a single instance is
    Andrew> shared by the parse and join methods - the only virtue of this
    Andrew> is config is shared (so the same options are used to write the
    Andrew> file as were used to read the file).

    ...

    Andrew> The idea being you'd then re-write the file with the same
    Andrew> sniffed options.

In my work, I rarely read and write the same file.  I either read a file,
then shoot it to a database or go the other way.  In situations where the
input and output are both CSV files, at least one is stdout, and there is
almost always something different about the reading and writing parameters.

    Andrew> Another idea occurs - looping over an iteratable is going to be
    Andrew> common - we could probably supply a convenience function, say
    Andrew> "writelines(iteratable)"?

Seems reasonable.

Skip

_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From skip at pobox.com  Thu Jan 30 02:57:30 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 29 Jan 2003 19:57:30 -0600
Subject: [Csv] CSV interface question
In-Reply-To: <1043867895.16012.87.camel@software1.logiplex.internal>
References: <20030129060501.DB9193C1F4@coffee.object-craft.com.au>
        <20030129101658.4E95C3C1F4@coffee.object-craft.com.au>
        <m3adhk9qmh.fsf@ferret.object-craft.com.au>
        <1043861462.16012.46.camel@software1.logiplex.internal>
        <15928.4659.449989.410123@montanaro.dyndns.org>
        <1043863704.16012.64.camel@software1.logiplex.internal>
        <15928.6851.934680.995625@montanaro.dyndns.org>
        <1043867895.16012.87.camel@software1.logiplex.internal>
Message-ID: <15928.34442.337899.905054@montanaro.dyndns.org>


    Cliff> Consider now the programmer actually defining a new dialect:
    Cliff> Passing a class or other structure (a dict is fine), they can
    Cliff> create this on the fly with minimal work.  Using a *string*, they
    Cliff> must first "register" that string somewhere (probably in the
    Cliff> mapping we agree upon) before they can actually make the function
    Cliff> call.  Granted, it's only a an extra step, but it requires a bit
    Cliff> more knowledge (of the mapping) and doesn't seem to provide a
    Cliff> real benefit.  If you prefer a mapping to a class, that is fine,
    Cliff> but lets pass the mapping rather than a string referring to it:

Somewhere I think we still need to associate string names with these
beasts.  Maybe it's just another attribute:

    class dialect:
        name = None

    class excel(dialect):
        name = "excel"
        ...

They should all be collected together for operation as a group.  This could
be so a GUI knows all the names to present or so a sniffer can return all
the dialects with which a sample file is compatible.  Both operations
suggest the need to register dialects somehow.

Skip

_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From skip at pobox.com  Thu Jan 30 03:07:37 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 29 Jan 2003 20:07:37 -0600
Subject: [Csv] CSV interface question
In-Reply-To: <m3bs1z5z04.fsf@ferret.object-craft.com.au>
References: <20030129060501.DB9193C1F4@coffee.object-craft.com.au>
        <20030129101658.4E95C3C1F4@coffee.object-craft.com.au>
        <m3adhk9qmh.fsf@ferret.object-craft.com.au>
        <1043861462.16012.46.camel@software1.logiplex.internal>
        <15928.4659.449989.410123@montanaro.dyndns.org>
        <1043863704.16012.64.camel@software1.logiplex.internal>
        <15928.6851.934680.995625@montanaro.dyndns.org>
        <m3bs1z5z04.fsf@ferret.object-craft.com.au>
Message-ID: <15928.35049.893899.848768@montanaro.dyndns.org>

    Skip> I was thinking of dialects as dicts....

    Dave> Note the spelling error in "linetermintor" - user constructed
    Dave> dictionaries are not good.

Yeah, but Cliff's dialect validator would have caught that. ;-)

    Dave> Whenever I find myself using dictionaries for storing values as
    Dave> opposed to indexing data I can't escape the feeling that my past
    Dave> as a Perl programmer is coming back to haunt me.  At least with
    Dave> Perl there is some syntactic sugar to make this type of thing less
    Dave> ugly:

    Dave> excel_dialect = { quotechar => '"',
    Dave>                   delimiter => ',',
    Dave>                   linetermintor => '\r\n' }

Other than losing a couple quote marks and substituting => for : I don't see
how the Perl syntax is any better.

Note also that with dicts you can simply pass them as keyword args:

    return _csv.reader(..., **kwdargs)

You'll have to do a little more work with classes to make that work (a
subclass's __dict__ attribute does not include the parent class's __dict__
contents) and with the possibility of new-style classes you will have to
work even harder.

    Dave> Maybe we could include a name attribute which allowed us to use
    Dave> 'excel-tsv' as a dialect identifier.

As I mentioned in my last post, I think name attributes will be necessary,
at least for human consumption.

    Dave> def reader(fileobj, dialect=excel, **kwds):
    Dave>     kwargs = {}
    Dave>     for key, value in dialect.__dict__.iteritems():
    Dave>         if not key.startswith('_'):
    Dave>             kwargs[key] = value
    Dave>     kwargs.update(kwds)
    Dave>     return _csv.reader(fileobj, **kwargs)

Not quite.  You need to traverse the bases to pick up everything.

Skip
_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From skip at pobox.com  Thu Jan 30 03:10:58 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 29 Jan 2003 20:10:58 -0600
Subject: [Csv] Sniffing dialects
In-Reply-To: <m33cnb5yt7.fsf@ferret.object-craft.com.au>
References: <15927.65132.432457.594501@montanaro.dyndns.org>
        <m33cnb5yt7.fsf@ferret.object-craft.com.au>
Message-ID: <15928.35250.269725.510622@montanaro.dyndns.org>


    Dave> I am all for dialects as attribute only objects.  You get the same
    Dave> effect as a dict but with less Perlish syntax.

I'll cave on this one, but I still think dicts are the better solution,
especially if dialects might be read from data files.  There's also the
issue of mapping dialects as classes onto keyword argument dicts.

Skip

_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From andrewm at object-craft.com.au  Thu Jan 30 03:21:13 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Thu, 30 Jan 2003 13:21:13 +1100
Subject: [Csv] Sniffing dialects 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15928.35250.269725.510622@montanaro.dyndns.org> 
References: <15927.65132.432457.594501@montanaro.dyndns.org>
	<m33cnb5yt7.fsf@ferret.object-craft.com.au>
	<15928.35250.269725.510622@montanaro.dyndns.org> 
Message-ID: <20030130022113.7E9E93C32B@coffee.object-craft.com.au>

>    Dave> I am all for dialects as attribute only objects.  You get the same
>    Dave> effect as a dict but with less Perlish syntax.
>
>I'll cave on this one, but I still think dicts are the better solution,
>especially if dialects might be read from data files.  There's also the
>issue of mapping dialects as classes onto keyword argument dicts.

Have you had a look at the code I checked in as csv.py in the sandbox?

Aside from the inheritance, I prefer dicts. But the inheritance feels
like a valuable addition.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/
_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From skip at pobox.com  Thu Jan 30 03:21:51 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 29 Jan 2003 20:21:51 -0600
Subject: [Csv] Re: First Cut at CSV PEP
In-Reply-To: <m3y9534jz0.fsf@ferret.object-craft.com.au>
References: <15926.1287.36487.12649@montanaro.dyndns.org>
        <KJEOLDOPMIDKCMJDCNDPMEPFCMAA.altis@semi-retired.com>
        <15926.64576.481489.373053@montanaro.dyndns.org>
        <m3lm14rfkf.fsf@ferret.object-craft.com.au>
        <15927.15551.93504.635849@montanaro.dyndns.org>
        <m3r8aw7h6h.fsf@ferret.object-craft.com.au>
        <15928.271.283784.851985@montanaro.dyndns.org>
        <m3y9534jz0.fsf@ferret.object-craft.com.au>
Message-ID: <15928.35903.293437.273039@montanaro.dyndns.org>


    Skip> We could do like some of the DB API modules do and provide
    Skip> mappings which take the types of objects and see if a function
    Skip> exists to handle that type.

    Dave> I think this would make things too slow.  

Not in the typical case.  The typical case would be the null converter case.

    Dave> The Python core already has a convenience function for doing the
    Dave> necessary conversion; PyObject_Str().

This smacks of implicit type conversions to me, which has been the bane of
my interaction with Perl (via XML-RPC).  I still think we have no business
writing anything but strings, Unicode strings (encoded by codecs.open()),
ints and floats to CSV files.  Exceptions should be raised for anything
else, even None.  An empty field is "".

    Dave> If we are in a hurry we could document the existing low level
    Dave> writer behaviour which is to invoke PyObject_Str() for all
    Dave> non-string values except None.  None is translated to ''.

I really still dislike this whole None thing.  Whose use case is that
anyway? 

    Skip> Needless to say, our csv module should *not* do that.  Fried data,
    Skip> when accompanied by angry mobs, doesn't taste too good.  If the
    Skip> user specifies "never", I think an exception should be raised if
    Skip> no escape character is defined and fields containing the delimiter
    Skip> are encountered.

    Dave> Should the _csv parser should sanity check the combination of
    Dave> options in the constructor, or when told to write data which is
    Dave> broken?

I think only when a row is written which would create an ambiguous row.
Upon reading you have no real choice.  If there's an unescaped embedded
delimiter in an unquoted field, how is the reader object to know the user
doesn't want multiple fields?

    Dave> It is possible to define no quote or escape character but still
    Dave> write valid data.

    Dave>   1,2,3,4

Yup, and it should work okay, only barfing when there is an actual
ambiguity.

Skip

_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From skip at pobox.com  Thu Jan 30 03:23:10 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 29 Jan 2003 20:23:10 -0600
Subject: [Csv] CSV interface question
In-Reply-To: <m3ptqf4iuf.fsf@ferret.object-craft.com.au>
References: <20030129060501.DB9193C1F4@coffee.object-craft.com.au>
        <20030129101658.4E95C3C1F4@coffee.object-craft.com.au>
        <m3adhk9qmh.fsf@ferret.object-craft.com.au>
        <15928.3265.630020.528438@montanaro.dyndns.org>
        <20030129234345.3CE6D3C32B@coffee.object-craft.com.au>
        <m3ptqf4iuf.fsf@ferret.object-craft.com.au>
Message-ID: <15928.35982.667260.27999@montanaro.dyndns.org>


    Dave> I would prefer attribute only flat objects.  

Sounds like a dictionary to me. ;-)

Skip
_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From skip at pobox.com  Thu Jan 30 03:29:00 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 29 Jan 2003 20:29:00 -0600
Subject: [Csv] Sniffing dialects 
In-Reply-To: <20030130022113.7E9E93C32B@coffee.object-craft.com.au>
References: <15927.65132.432457.594501@montanaro.dyndns.org>
        <m33cnb5yt7.fsf@ferret.object-craft.com.au>
        <15928.35250.269725.510622@montanaro.dyndns.org>
        <20030130022113.7E9E93C32B@coffee.object-craft.com.au>
Message-ID: <15928.36332.883241.640958@montanaro.dyndns.org>


    >> I'll cave on this one, but I still think dicts are the better
    >> solution, especially if dialects might be read from data files.
    >> There's also the issue of mapping dialects as classes onto keyword
    >> argument dicts.

    Andrew> Have you had a look at the code I checked in as csv.py in the
    Andrew> sandbox?

Not since midday, and I wasn't looking closely.

Skip
_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From skip at pobox.com  Thu Jan 30 03:48:59 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 29 Jan 2003 20:48:59 -0600
Subject: [Csv] Status
Message-ID: <15928.37531.445243.692589@montanaro.dyndns.org>


It would appear we are converging on dialects as data-only classes
(subclassable but with no methods).  I'll update the PEP.  Many other ideas
have been floating through the list, and while I haven't been deleting the
messages, I haven't been adding them to the PEP either.  Can someone help
with that?

I'd like to get the wording in the PEP to converge on our current thoughts
and announce it on c.l.py and python-dev sometime tomorrow.  I think we will
get a lot of feedback from both camps, hopefully some of it useful. ;-)

Sound like a plan?

I just finished making a pass through the messages I hadn't deleted (and
then saved them to a csv mbox file since the list appears to still not be
archiving).  Here's what I think we've concluded:

    * Dialects are a set of defaults, probably implemented as classes (which
      allows subclassing, whereas dicts wouldn') and the default dialect
      named as something like csv.dialects.excel or "excel" if we allow
      string specifiers.  (I think strings work well at the API, simply
      because they are shorter and can more easily be presented in GUI
      tools.)

    * A csvutils module should be at least scoped out which might do a fair
      number of things:

      - Implements one or more sniffers for parameter types

      - Validates CSV files (e.g., constant number of columns, type
        constraints on column values, compares against given dialect)

      - Generate a sniffer from a CSV file

    * These individual parameters are necessary (hopefully the names will be
      enough clue as to there meaning): quote_char, quoting ("auto",
      "always", "nonnumeric", "never"), delimiter, line_terminator,
      skip_whitespace, escape_char, hard_return.  Are there others?

    * We're still undecided about None (I certainly don't think it's a valid
      value to be writing to CSV files)

    * Rows can have variable numbers of columns and the application is
      responsible for deciding on and enforcing max_rows or max_cols.

    * Don't raise exceptions needlessly.  For example, specifying
      quoting="never" and not specifying a value for escape_char would be
      okay until you encounter a field when writing which contains the
      delimiter.

    * Files have to be opened in binary mode (we can check the mode
      attribute I believe) so we can do the right thing with line
      terminators.

    * Data values should always be returned as strings, even if they are
      valid numbers.  Let the application do data conversion.

Other stuff we haven't talked about much:

    * Unicode.  I think we punt on this for now and just pretend that
      passing codecs.open(csvfile, mode, encoding) is sufficient.  I'm sure
      Martin von L?wis will let us know if it isn't. ;-) Dave said, "The low
      level parser (C code) is probably going to need to handle unicode."
      Let's wait and see how well codecs.open() works for us.

    * We know we need tests but haven't talked much about them.  I vote for
      PyUnit as much as possible, though a certain amount of manual testing
      using existing spreadsheets and databases will be required.

    * Exceptions.  We know we need some.  We should start with CSVError and
      try to avoid getting carried away with things.  If need be, we can add
      a code field to the class.  I don't like the idea of having 17
      different subclasses of CSVError though.  It's too much complexity for
      most users.

Skip
_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From altis at semi-retired.com  Thu Jan 30 04:01:19 2003
From: altis at semi-retired.com (Kevin Altis)
Date: Wed, 29 Jan 2003 19:01:19 -0800
Subject: [Csv] Re: First Cut at CSV PEP
In-Reply-To: <15928.35903.293437.273039@montanaro.dyndns.org>
Message-ID: <KJEOLDOPMIDKCMJDCNDPCECPCNAA.altis@semi-retired.com>

> From: Skip Montanaro
>
>     Dave> The Python core already has a convenience function for doing the
>     Dave> necessary conversion; PyObject_Str().
>
> This smacks of implicit type conversions to me, which has been the bane of
> my interaction with Perl (via XML-RPC).  I still think we have no business
> writing anything but strings, Unicode strings (encoded by codecs.open()),
> ints and floats to CSV files.  Exceptions should be raised for anything
> else, even None.  An empty field is "".
>
>     Dave> If we are in a hurry we could document the existing low level
>     Dave> writer behaviour which is to invoke PyObject_Str() for all
>     Dave> non-string values except None.  None is translated to ''.
>
> I really still dislike this whole None thing.  Whose use case is that
> anyway?

I think I brought up None. There was some initial confusion because Cliff's
DSV exporter was doing the wrong thing. My feeling is that if you have a
list

[5, 'Bob', None, 1.1]

as a csv with the Excel dialect that becomes

5,Bob,,1.1

Are you saying that you want to throw an exception instead? Booleans may
also present a problem. I was mostly thinking in terms of importing and
exporting data from embedded databases like MetaKit, my own list of
dictionaries (flatfile stuff), PySQLite, Gadfly. Anyway, the implication
might be that it is necessary for the user to sanitize data as part of the
export operation too. Have to ponder that.

Regardless, we have to be careful to not make this too complicated or it
will be worse than nothing.

Quotes aren't going to get used in the case above unless you've specified to
always use them (overridden part of the Excel dialect), because no field
contains the comma separator character. Now that I look at this again the
Access export dialog I sent in an earlier email shows that the default
Access csv is actually a separate dialect because they specifically call out
the "Text qualifier" while numbers, empty fields (probably NULLS in SQL?)
will not have quotes, only text fields will.

To further complicate things I'm now wondering what happens with numbers in
a Europe ore elsewhere where the comma is used instead of a decimal point so
1.1 is 1,1 or does that not actually occur and I'm remembering some
localization issues incorrectly?

Reading in

5,Bob,,1.1

becomes

['5', 'Bob', '', '1.1']

because we said we weren't going to do further processing, the user code
should do further conversions as part of the iteration.

I'm way behind on reading all the emails. I got bogged down in a bunch of
Mac OS X testing... I'll try and dig through them a little tomorrow and
Friday.

If we put together the unittest test cases first then our input, output, and
expected results for processing would be clear for a given dialect.

ka

_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From andrewm at object-craft.com.au  Thu Jan 30 04:12:54 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Thu, 30 Jan 2003 14:12:54 +1100
Subject: [Csv] Status 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15928.37531.445243.692589@montanaro.dyndns.org> 
References: <15928.37531.445243.692589@montanaro.dyndns.org> 
Message-ID: <20030130031254.D2E853C32B@coffee.object-craft.com.au>

>I'd like to get the wording in the PEP to converge on our current thoughts
>and announce it on c.l.py and python-dev sometime tomorrow.  I think we will
>get a lot of feedback from both camps, hopefully some of it useful. ;-)
>
>Sound like a plan?

Yep, pending an ACK from the others.

>I just finished making a pass through the messages I hadn't deleted (and
>then saved them to a csv mbox file since the list appears to still not be
>archiving).  Here's what I think we've concluded:

I have all the messages archived, which I can forward to you in a
convenient form for feeding to mailman.

>    * Dialects are a set of defaults, probably implemented as classes (which
>      allows subclassing, whereas dicts wouldn') and the default dialect
>      named as something like csv.dialects.excel or "excel" if we allow
>      string specifiers.  (I think strings work well at the API, simply
>      because they are shorter and can more easily be presented in GUI
>      tools.)

I think you are right - we need strings as well, and a way to list
them. But exposing the "dialects are classes" to the user of the module
is valuable.

I'd vote +1 on giving the class a "name" attribute, and the dialects
should probably share a common null root class (say "dialect") - the
"list_dialects()" function could then walk the csv.dialects namespace
returning the names of any classses found that are subclasses of dialect?

>    * These individual parameters are necessary (hopefully the names will be
>      enough clue as to there meaning): quote_char, quoting ("auto",
>      "always", "nonnumeric", "never"), delimiter, line_terminator,
>      skip_whitespace, escape_char, hard_return.  Are there others?

Not that I can think of at the moment. As other dialects appear, we may
want to add new paramaters anyway.

>    * We're still undecided about None (I certainly don't think it's a valid
>      value to be writing to CSV files)

I suspect we're in violent agreement? If the user happens to pass None,
it should be written as a null field. On input, a null field should be
returned as a zero length string. Is that what you were suggesting?

>    * Don't raise exceptions needlessly.  For example, specifying
>      quoting="never" and not specifying a value for escape_char would be
>      okay until you encounter a field when writing which contains the
>      delimiter.

I don't like this specific one. Because it depends on the data, the
module user may not pick up their error during testing. Better to raise
an exception immediately if we know the format is invalid.

This is an argument I have over and over - I believe it's nearly always
better to push errors back towards their source. In spite of how it
sounds, this isn't really at odds with "be liberal in what you accept,
be strict in what you generate".

>    * Files have to be opened in binary mode (we can check the mode
>      attribute I believe) so we can do the right thing with line
>      terminators.

We need to be a little careful when using uncommon interfaces on the
file class, because file-like classes may not have implemented them
(for example, StringIO doesn't have the mode attribute).

>    * Data values should always be returned as strings, even if they are
>      valid numbers.  Let the application do data conversion.

Yes. +1

>Other stuff we haven't talked about much:
>
>    * Unicode.  I think we punt on this for now and just pretend that
>      passing codecs.open(csvfile, mode, encoding) is sufficient.  I'm sure
>      Martin von L?wis will let us know if it isn't. ;-) Dave said, "The low
>      level parser (C code) is probably going to need to handle unicode."
>      Let's wait and see how well codecs.open() works for us.

I'm almost 100% certain the C code will need work. But it should the
sort of work that can be done without disturbing the interface too much?

>    * We know we need tests but haven't talked much about them.  I vote for
>      PyUnit as much as possible, though a certain amount of manual testing
>      using existing spreadsheets and databases will be required.

This is the big one - tests are absolutely essential. 

I put a bit of effort into coming up with a bunch of "this is how Excel
does it with this unusual case" tests for our csv module - we can use
this as a start.

I haven't investigated how the official python test harness works - it
predates pyunit.

>    * Exceptions.  We know we need some.  We should start with CSVError and
>      try to avoid getting carried away with things.  If need be, we can add
>      a code field to the class.  I don't like the idea of having 17
>      different subclasses of CSVError though.  It's too much complexity for
>      most users.

Agreed.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/
_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From skip at pobox.com  Thu Jan 30 04:35:49 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 29 Jan 2003 21:35:49 -0600
Subject: [Csv] Status 
In-Reply-To: <20030130031254.D2E853C32B@coffee.object-craft.com.au>
References: <15928.37531.445243.692589@montanaro.dyndns.org>
        <20030130031254.D2E853C32B@coffee.object-craft.com.au>
Message-ID: <15928.40341.991680.82247@montanaro.dyndns.org>


    >> * We're still undecided about None (I certainly don't think it's a
    >>   valid value to be writing to CSV files)

    Andrew> I suspect we're in violent agreement? If the user happens to
    Andrew> pass None, it should be written as a null field. On input, a
    Andrew> null field should be returned as a zero length string. Is that
    Andrew> what you were suggesting?

Not really.  In my mind, if I try to write

    [5.0, "marty", "golden slippers", None]

then I have a bug somewhere.  I *don't* want None silently converted to ''.

    >> * Don't raise exceptions needlessly.  For example, specifying
    >> quoting="never" and not specifying a value for escape_char would be
    >> okay until you encounter a field when writing which contains the
    >> delimiter.

    Andrew> I don't like this specific one. Because it depends on the data,
    Andrew> the module user may not pick up their error during
    Andrew> testing. Better to raise an exception immediately if we know the
    Andrew> format is invalid.

I can live with that.  I would propose then that escape_char default to
something reasonable, not None.

    Andrew> This is an argument I have over and over - I believe it's nearly
    Andrew> always better to push errors back towards their source. In spite
    Andrew> of how it sounds, this isn't really at odds with "be liberal in
    Andrew> what you accept, be strict in what you generate".

If I cave on this, they you have to cave on None. ;-)

    >> * Files have to be opened in binary mode (we can check the mode
    >> attribute I believe) so we can do the right thing with line
    >> terminators.

    Andrew> We need to be a little careful when using uncommon interfaces on
    Andrew> the file class, because file-like classes may not have
    Andrew> implemented them (for example, StringIO doesn't have the mode
    Andrew> attribute).

Correct.  That occurred to me as well.  Do we just punt if hasattr(fileobj,
"mode") returns False?

    >> * Unicode.  I think we punt on this for now and just pretend that
    >> passing codecs.open(csvfile, mode, encoding) is sufficient.  I'm sure
    >> Martin von L?wis will let us know if it isn't. ;-) Dave said, "The low
    >> level parser (C code) is probably going to need to handle unicode."
    >> Let's wait and see how well codecs.open() works for us.

    Andrew> I'm almost 100% certain the C code will need work. But it should
    Andrew> the sort of work that can be done without disturbing the
    Andrew> interface too much?

"Handle Unicode" probably doesn't mean messing with encoding/decoding issues
though.  Let the user deal with them.

    Andrew> I haven't investigated how the official python test harness
    Andrew> works - it predates pyunit.

Most new tests are written using unittest (nee PyUnit) and many existing
tests are getting converted.  If we use the core test framework for as
much as we can, our unit tests will just move cleanly from the sandbox to
Lib/test/. 

Now to see about Mailman 2.1...

Skip


From skip at pobox.com  Thu Jan 30 04:49:44 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 29 Jan 2003 21:49:44 -0600
Subject: [Csv] Mailman upgrade Thursday on manatee.mojam.com
Message-ID: <15928.41176.310569.780900@montanaro.dyndns.org>


To all people subscribed to mailing lists hosted on manatee.mojam.com:

I plan to upgrade the Mailman software on manatee.mojam.com (aka
mail.mojam.com) sometime Thursday.  I don't know the exact time because it
will be a sort of as-I-have-time sort of thing.

To perform the upgrade I will have to shut down mail service on the system
for a time.  I hope to keep that time to a minimum, but it will depend on
what problems I encounter.  During that time mail should queue up on remote
hosts.  Don't be alarmed if mail messages from your favorite mailing list
stops arriving for awhile.

I'll send out another message once the upgrade is complete or I've utterly
failed and fallen back to the older version.

-- 
Skip Montanaro
skip at pobox.com
http://www.musi-cal.com/
_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From andrewm at object-craft.com.au  Thu Jan 30 04:51:41 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Thu, 30 Jan 2003 14:51:41 +1100
Subject: [Csv] Status 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15928.40341.991680.82247@montanaro.dyndns.org> 
References: <15928.37531.445243.692589@montanaro.dyndns.org>
	<20030130031254.D2E853C32B@coffee.object-craft.com.au>
	<15928.40341.991680.82247@montanaro.dyndns.org> 
Message-ID: <20030130035141.271EA3C32B@coffee.object-craft.com.au>

>Not really.  In my mind, if I try to write
>
>    [5.0, "marty", "golden slippers", None]
>
>then I have a bug somewhere.  I *don't* want None silently converted to ''.

I think you might be right.

[invalid combinations of options]
>I can live with that.  I would propose then that escape_char default to
>something reasonable, not None.

That's a little hairy, because the resulting file can't be parsed
correctly by Excel. But it should be safe if the escape_char is only
emitted if quote is set to none.

>If I cave on this, they you have to cave on None. ;-)

*-)

[binary file mode, StringIO has no mode attribute]
>Correct.  That occurred to me as well.  Do we just punt if hasattr(file,
>obj, "mode") returns False?

Yes (or just catch the AttributeError and ignore it).

>"Handle Unicode" probably doesn't mean messing with encoding/decoding
>issues though.  Let the user deal with them.

But the C code will care if it's passed a unicode string (which, I
understand, are not 8 bits per character - typically 16 bits). And the
escape_char, etc, will be 16 bits. I understand that some of the other C
modules are compiled twice and #define tricks are used to produce two
versions that perform optimally on their respective string type.

>Now to see about Mailman 2.1...

Did you try my suggestion? I have a vague memory of there being an
earlier version of Mailman that forgot to create that file.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/
_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From skip at pobox.com  Thu Jan 30 05:09:06 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 29 Jan 2003 22:09:06 -0600
Subject: [Csv] Status 
In-Reply-To: <20030130035141.271EA3C32B@coffee.object-craft.com.au>
References: <15928.37531.445243.692589@montanaro.dyndns.org>
        <20030130031254.D2E853C32B@coffee.object-craft.com.au>
        <15928.40341.991680.82247@montanaro.dyndns.org>
        <20030130035141.271EA3C32B@coffee.object-craft.com.au>
Message-ID: <15928.42338.298849.316715@montanaro.dyndns.org>


    >> "Handle Unicode" probably doesn't mean messing with encoding/decoding
    >> issues though.  Let the user deal with them.

    Andrew> But the C code will care if it's passed a unicode string (which,
    Andrew> I understand, are not 8 bits per character - typically 16
    Andrew> bits). And the escape_char, etc, will be 16 bits. I understand
    Andrew> that some of the other C modules are compiled twice and #define
    Andrew> tricks are used to produce two versions that perform optimally
    Andrew> on their respective string type.

In the C code can't you just look up "split", "join", "__add__" and such and
not care that you are dealing with string or unicode objects?  Even better,
can't you just make heavy use of the abstract interface which implements
many of the things that are trivial in Python code?

    >> Now to see about Mailman 2.1...

    Andrew> Did you try my suggestion? I have a vague memory of there being
    Andrew> an earlier version of Mailman that forgot to create that file.

Yup.  Now there's an empty csv.mbox file available on the web...

Skip
_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv


From skip at pobox.com  Thu Jan 30 06:53:55 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 29 Jan 2003 23:53:55 -0600
Subject: [Csv] test message
Message-ID: <15928.48627.343496.857274@montanaro.dyndns.org>

Test of reconstituted csv list under Mailman 2.1

S

From andrewm at object-craft.com.au  Thu Jan 30 06:58:39 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Thu, 30 Jan 2003 16:58:39 +1100
Subject: [Csv] Status 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15928.42338.298849.316715@montanaro.dyndns.org> 
References: <15928.37531.445243.692589@montanaro.dyndns.org>
	<20030130031254.D2E853C32B@coffee.object-craft.com.au>
	<15928.40341.991680.82247@montanaro.dyndns.org>
	<20030130035141.271EA3C32B@coffee.object-craft.com.au>
	<15928.42338.298849.316715@montanaro.dyndns.org> 
Message-ID: <20030130055839.64E903C32B@coffee.object-craft.com.au>

>In the C code can't you just look up "split", "join", "__add__" and such and
>not care that you are dealing with string or unicode objects?  Even better,
>can't you just make heavy use of the abstract interface which implements
>many of the things that are trivial in Python code?

Currently the C module just deals with raw strings. I suspect there
would be a fair performance cost to using the string object's methods
(I should have a look at how strings and unicode strings are implemented
internally these days).  Suffice to say, it's a reasonable amount of work. 

We probably should be focusing on refining the PEP and writing some tests
at this stage... 8-)

Regarding the PEP -

 - are we going to retain the ability to pass keyword arguments, that
   override the dialect, to the factory functions (the pep doesn't mention
   this)?

 - we could make the dialect parameter accept either a string dialect name
   or a dialect instance - is this a good idea?

 - regarding the dialect list function - this probably should be called
   list_dialects(), yes?

 - should we call the delimiter parameter "field_sep" instead (I
   notice you haven't used underscores in the parameter names - is
   this deliberate)?

Thinking about the tests, I envisage a bunch of tests for the underlying
C module, and tests for each dialect (just the basic dialect with no
additional parameters)?

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From andrewm at object-craft.com.au  Thu Jan 30 06:59:10 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Thu, 30 Jan 2003 16:59:10 +1100
Subject: [Csv] Module question...
Message-ID: <20030130055910.828A43C32B@coffee.object-craft.com.au>

The way we've speced it, the module only deals with file objects. I
wonder if there's any need to deal with strings, rather than files?

What was the rational for using files, rather making the user do their
own readline(), etc?

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From skip at pobox.com  Thu Jan 30 06:59:49 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 29 Jan 2003 23:59:49 -0600
Subject: [Csv] Looks like we're live...
Message-ID: <15928.48981.251010.410861@montanaro.dyndns.org>

It looks like I successfully migrated this mailing list to Mailman 2.1.  We
have archives and everything.  Andrew, you said you had an archive of all
the messages.  Can you pass that along to me with any tips you feel
worthwhile about incorporating that archive into pipermail?

Thx,

Skip

From skip at pobox.com  Thu Jan 30 07:08:02 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu, 30 Jan 2003 00:08:02 -0600
Subject: [Csv] Status 
In-Reply-To: <20030130055839.64E903C32B@coffee.object-craft.com.au>
References: <15928.37531.445243.692589@montanaro.dyndns.org>
        <20030130031254.D2E853C32B@coffee.object-craft.com.au>
        <15928.40341.991680.82247@montanaro.dyndns.org>
        <20030130035141.271EA3C32B@coffee.object-craft.com.au>
        <15928.42338.298849.316715@montanaro.dyndns.org>
        <20030130055839.64E903C32B@coffee.object-craft.com.au>
Message-ID: <15928.49474.186478.320826@montanaro.dyndns.org>


    Andrew> We probably should be focusing on refining the PEP and writing
    Andrew> some tests at this stage... 8-)

That sounds like a good idea.

    Andrew> Regarding the PEP -

    Andrew>  - are we going to retain the ability to pass keyword arguments,
    Andrew>    that override the dialect, to the factory functions (the pep
    Andrew>    doesn't mention this)?

Yes, I thought that was the plan.  The current text under Module Interface
gives an incomplete function prototype:

    reader(fileobj [, dialect='excel2000'])

but in the text below it says:

    The optional dialect parameter is discussed below.  It also accepts
    several keyword parameters which define specific format settings (see
    the section "Formatting Parameters").

I'd like not to enumerate all the possible keyword parameters, especially
since that list may grow.  How should I write the synopsis?

    reader(fileobj [, dialect='excel2000'] [, keyword parameters])

?

    Andrew>  - we could make the dialect parameter accept either a string
    Andrew>    dialect name or a dialect instance - is this a good idea?

It can pretty easily do both.  Perhaps we should present the pros and cons
in the PEP and see what kind of feedback we get.

    Andrew>  - regarding the dialect list function - this probably should be
    Andrew>    called list_dialects(), yes?

Where do you see dialect_list()?  Maybe I need to cvs up.  In any case, I
like list_dialects() better.

    Andrew>  - should we call the delimiter parameter "field_sep" instead (I
    Andrew>    notice you haven't used underscores in the parameter names -
    Andrew>    is this deliberate)?

I don't have a big preference one way or the other.  I've been calling it
"delimiter" though.

    Andrew> Thinking about the tests, I envisage a bunch of tests for the
    Andrew> underlying C module, and tests for each dialect (just the basic
    Andrew> dialect with no additional parameters)?

Give me one test you'd like to run and one set of inputs and expected
outputs.  I'll set up a module tomorrow which should just drop into
Lib/test.  I'm kind of running out of steam.  (It's Thursday 12:07am here.)

Skip

From altis at semi-retired.com  Thu Jan 30 07:33:49 2003
From: altis at semi-retired.com (Kevin Altis)
Date: Wed, 29 Jan 2003 22:33:49 -0800
Subject: [Csv] Access Products sample
Message-ID: <KJEOLDOPMIDKCMJDCNDPGEDDCNAA.altis@semi-retired.com>

I created a db and table in Access (products.mdb) using one of the built-in
samples. I created two rows, one that is mostly empty. I used the default
CSV export to create(Products.csv) and also output the table as an Excel
97/2000 XLS file (Products.xls). Finally, I had Excel export as CSV
(ProductsExcel.csv). They are all contained in the attached zip.

The currency column in the table is actually written out with formatting
($5.66 instead of just 5.66). Note that when Excel exports this column it
has a trailing space for some reason (,$5.66 ,).

While exporting it reminded me that unless a column in the data set contains
an embedded newline or carriage return it shouldn't matter whether the file
is opened in binary mode for reading.

Without a schema we don't know what each column is supposed to contain, so
that is outside the domain of the csv import parser and export writer.

The values exported by both Access and Excel are designed to prevent
information loss within the constraints of the CSV format, thus a field with
no value (what I think of as None in Python) is empty in the CSV

We should we be able to import and then export using a given dialect, such
that there would be no differences between the original csv and the exported
one? Actually, using the Access default of quoting strings it isn't possible
to do that because it implies having a schema to know that a given column is
a string. With the Excel csv format it is possible because a column that
doesn't contain a comma won't be quoted.

Just thinking out loud.

ka
-------------- next part --------------
A non-text attachment was scrubbed...
Name: products.zip
Type: application/x-zip-compressed
Size: 17035 bytes
Desc: not available
Url : http://mail.python.org/pipermail/csv/attachments/20030129/e29424d8/attachment.bin 

From altis at semi-retired.com  Thu Jan 30 07:55:03 2003
From: altis at semi-retired.com (Kevin Altis)
Date: Wed, 29 Jan 2003 22:55:03 -0800
Subject: [Csv] Module question...
In-Reply-To: <20030130055910.828A43C32B@coffee.object-craft.com.au>
Message-ID: <KJEOLDOPMIDKCMJDCNDPAEDECNAA.altis@semi-retired.com>

> From: Andrew McNamara
>
> The way we've speced it, the module only deals with file objects. I
> wonder if there's any need to deal with strings, rather than files?

A string can be wrapped as StringIO to appear as a file and there may also
be other file-like objects that people might want to pass in.

> What was the rational for using files, rather making the user do their
> own readline(), etc?

I'll try and summarize, if this is too simplistic or incorrect I'm sure
someone will speak up :)

The simplest solution might have been to provide a file path and then let
the parser handle all the opening, reading, and closing, returning a result
list. However, that is far too limiting since then if you do want to parse a
string or something that isn't a physical file on disk you have to collect
the raw data, write it to a temp file and then pass the path of the temp
file in. Definitely, too cumbersome.

It would be possible to require the user code to supply one large string to
parse, thus putting the burden of opening, reading, and closing the
file-like object. This wastes memory, which can be a problem especially for
large data files.

One other possibility would be for the parser to only deal with one row at a
time, leaving it up to the user code to feed the parser the row strings. But
given the various possible line endings for a row of data and the fact that
a column of a row may contain a line ending, not to mention all the other
escape character issues we've discussed, this would be error-prone.

The solution was to simply accept a file-like object and let the parser do
the interpretation of a record. By having the parser present an iterable
interface, the user code still gets the convenience of processing per row if
needed or if no processing is desired a result list can easily be obtained.

This should provide the most flexibility while still being easy to use.

ka


From altis at semi-retired.com  Thu Jan 30 07:58:22 2003
From: altis at semi-retired.com (Kevin Altis)
Date: Wed, 29 Jan 2003 22:58:22 -0800
Subject: [Csv] change of Sender address
Message-ID: <KJEOLDOPMIDKCMJDCNDPEEDECNAA.altis@semi-retired.com>

Skip,
the mailing list Sender: is now csv-bounces at mail.mojam.com while previously
it was csv-admin at mail.mojam.com. Is that intentional?

ka


From andrewm at object-craft.com.au  Thu Jan 30 08:24:57 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Thu, 30 Jan 2003 18:24:57 +1100
Subject: [Csv] Status 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15928.49474.186478.320826@montanaro.dyndns.org> 
References: <15928.37531.445243.692589@montanaro.dyndns.org>
	<20030130031254.D2E853C32B@coffee.object-craft.com.au>
	<15928.40341.991680.82247@montanaro.dyndns.org>
	<20030130035141.271EA3C32B@coffee.object-craft.com.au>
	<15928.42338.298849.316715@montanaro.dyndns.org>
	<20030130055839.64E903C32B@coffee.object-craft.com.au>
	<15928.49474.186478.320826@montanaro.dyndns.org> 
Message-ID: <20030130072457.25FEB3C32B@coffee.object-craft.com.au>

>    Andrew>  - are we going to retain the ability to pass keyword arguments,
>    Andrew>    that override the dialect, to the factory functions (the pep
>    Andrew>    doesn't mention this)?
>
>Yes, I thought that was the plan.

Just checking... 8-)

>I'd like not to enumerate all the possible keyword parameters, especially
>since that list may grow.  How should I write the synopsis?
>
>    reader(fileobj [, dialect='excel2000'] [, keyword parameters])
>
>?

Maybe make it "optional keyword parameters"... implied, I know, but...

>    Andrew>  - we could make the dialect parameter accept either a string
>    Andrew>    dialect name or a dialect instance - is this a good idea?
>
>It can pretty easily do both.  Perhaps we should present the pros and cons
>in the PEP and see what kind of feedback we get.

Sometimes you can give people too much choice. We don't have time for
an endless discussion. If we don't think we're going to be crucified,
we should just pick something that's tasteful. Dave?

>    Andrew>  - regarding the dialect list function - this probably should be
>    Andrew>    called list_dialects(), yes?
>
>Where do you see dialect_list()?  Maybe I need to cvs up.  In any case, I
>like list_dialects() better.

Ah - I mean "dialect list function" in the generic sense - we need one,
and I was proposing to call it list_dialects, or maybe that should be
listdialects to be like listdir... nah, looks ugly.

>    Andrew>  - should we call the delimiter parameter "field_sep" instead (I
>    Andrew>    notice you haven't used underscores in the parameter names -
>    Andrew>    is this deliberate)?
>
>I don't have a big preference one way or the other.  I've been calling it
>"delimiter" though.

Is there any precident in the other modules? Our module called it
field_sep, and I noticed you called it that in the description.

>    Andrew> Thinking about the tests, I envisage a bunch of tests for the
>    Andrew> underlying C module, and tests for each dialect (just the basic
>    Andrew> dialect with no additional parameters)?
>
>Give me one test you'd like to run and one set of inputs and expected
>outputs.  I'll set up a module tomorrow which should just drop into
>Lib/test.  I'm kind of running out of steam.  (It's Thursday 12:07am here.)

I might be able to work it out myself... we'll see.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From andrewm at object-craft.com.au  Thu Jan 30 08:33:52 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Thu, 30 Jan 2003 18:33:52 +1100
Subject: [Csv] Module question... 
In-Reply-To: Message from "Kevin Altis" <altis@semi-retired.com> 
	<KJEOLDOPMIDKCMJDCNDPAEDECNAA.altis@semi-retired.com> 
References: <KJEOLDOPMIDKCMJDCNDPAEDECNAA.altis@semi-retired.com> 
Message-ID: <20030130073352.A55953C32B@coffee.object-craft.com.au>

>> The way we've speced it, the module only deals with file objects. I
>> wonder if there's any need to deal with strings, rather than files?

BTW, I'm asking this because it's something that will come back to haunt
us if we get it wrong - it's something we need to make the right call on.

>A string can be wrapped as StringIO to appear as a file and there may also
>be other file-like objects that people might want to pass in.

Yes - if the most common use by far is reading and writing files, then
this is the right answer (i.e., say "use StringIO if you really need to
do a string").

>> What was the rational for using files, rather making the user do their
>> own readline(), etc?
>
>I'll try and summarize, if this is too simplistic or incorrect I'm sure
>someone will speak up :)
>
>The simplest solution might have been to provide a file path and then let
>the parser handle all the opening, reading, and closing, returning a result
>list. However, that is far too limiting since then if you do want to parse a
>string or something that isn't a physical file on disk you have to collect
>the raw data, write it to a temp file and then pass the path of the temp
>file in. Definitely, too cumbersome.

Yeah - I'm certainly not suggesting that.

>It would be possible to require the user code to supply one large string to
>parse, thus putting the burden of opening, reading, and closing the
>file-like object. This wastes memory, which can be a problem especially for
>large data files.

Agreed.

>One other possibility would be for the parser to only deal with one row at a
>time, leaving it up to the user code to feed the parser the row strings. But
>given the various possible line endings for a row of data and the fact that
>a column of a row may contain a line ending, not to mention all the other
>escape character issues we've discussed, this would be error-prone.

This is the way the Object Craft module has worked - it works well enough,
and the universal end-of-line stuff in 2.3 makes it more seamless. Not
saying I'm wedded to this scheme, but I'd just like to have clear why
we've chosen one over the other.

I'm trying to think of an example where operating on a file-like object
would be too restricting, and I can't - oh, here's one: what if you
wanted to do some pre-processing on the data (say it was uuencoded)?

>The solution was to simply accept a file-like object and let the parser do
>the interpretation of a record. By having the parser present an iterable
>interface, the user code still gets the convenience of processing per row if
>needed or if no processing is desired a result list can easily be obtained.
>
>This should provide the most flexibility while still being easy to use.

Should the object just be defined as an iteratable, and leave closing,
etc, up to the user of the module? One downside of this is you can't
rewind an iterator, so things like the sniffer would be SOL. We can't
ensure that the passed file is rewindable either. Hmmm.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From andrewm at object-craft.com.au  Thu Jan 30 08:36:37 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Thu, 30 Jan 2003 18:36:37 +1100
Subject: [Csv] Access Products sample 
In-Reply-To: Message from "Kevin Altis" <altis@semi-retired.com> 
	<KJEOLDOPMIDKCMJDCNDPGEDDCNAA.altis@semi-retired.com> 
References: <KJEOLDOPMIDKCMJDCNDPGEDDCNAA.altis@semi-retired.com> 
Message-ID: <20030130073637.CA4DD3C32B@coffee.object-craft.com.au>

>The currency column in the table is actually written out with formatting
>($5.66 instead of just 5.66). Note that when Excel exports this column it
>has a trailing space for some reason (,$5.66 ,).

I think you'll find that if you enter a negative amount, that space turns
into a minus sign (not verified).

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From andrewm at object-craft.com.au  Thu Jan 30 08:37:53 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Thu, 30 Jan 2003 18:37:53 +1100
Subject: [Csv] change of Sender address 
In-Reply-To: Message from "Kevin Altis" <altis@semi-retired.com> 
	<KJEOLDOPMIDKCMJDCNDPEEDECNAA.altis@semi-retired.com> 
References: <KJEOLDOPMIDKCMJDCNDPEEDECNAA.altis@semi-retired.com> 
Message-ID: <20030130073754.02FEF3C32B@coffee.object-craft.com.au>

>the mailing list Sender: is now csv-bounces at mail.mojam.com while previously
>it was csv-admin at mail.mojam.com. Is that intentional?

Mailman attempts to handle the bounces itself - I guess that's just
something that has changed between 2.0 and 2.1.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From altis at semi-retired.com  Thu Jan 30 09:54:16 2003
From: altis at semi-retired.com (Kevin Altis)
Date: Thu, 30 Jan 2003 00:54:16 -0800
Subject: [Csv] Module question... 
In-Reply-To: <20030130073352.A55953C32B@coffee.object-craft.com.au>
Message-ID: <KJEOLDOPMIDKCMJDCNDPMEDGCNAA.altis@semi-retired.com>

> From: Andrew McNamara
>
> >> The way we've speced it, the module only deals with file objects. I
> >> wonder if there's any need to deal with strings, rather than files?
>
> BTW, I'm asking this because it's something that will come back to haunt
> us if we get it wrong - it's something we need to make the right call on.

Agreed, in fact I'm now reconsidering my position.

> >One other possibility would be for the parser to only deal with
> one row at a
> >time, leaving it up to the user code to feed the parser the row
> strings. But
> >given the various possible line endings for a row of data and
> the fact that
> >a column of a row may contain a line ending, not to mention all the other
> >escape character issues we've discussed, this would be error-prone.
>
> This is the way the Object Craft module has worked - it works well enough,
> and the universal end-of-line stuff in 2.3 makes it more seamless. Not
> saying I'm wedded to this scheme, but I'd just like to have clear why
> we've chosen one over the other.

I'm tempted to agree that maybe your original way would be better, but I
haven't caught up on some of the discussion the last couple of days. Skip
and Cliff can probably argue effectively for not doing it that way if they
really want.

> I'm trying to think of an example where operating on a file-like object
> would be too restricting, and I can't - oh, here's one: what if you
> wanted to do some pre-processing on the data (say it was uuencoded)?

That seems to be stretching things a bit, but even then wouldn't you simply
pass the uuencoded file-like object to uu.decode and then pass the out_file
file-like object to the parser? I haven't used uu myself, so maybe that
wouldn't work. Regardless, the cvs module should be focused on one task.

> >The solution was to simply accept a file-like object and let the
> parser do
> >the interpretation of a record. By having the parser present an iterable
> >interface, the user code still gets the convenience of
> processing per row if
> >needed or if no processing is desired a result list can easily
> be obtained.
> >
> >This should provide the most flexibility while still being easy to use.
>
> Should the object just be defined as an iteratable, and leave closing,
> etc, up to the user of the module? One downside of this is you can't
> rewind an iterator, so things like the sniffer would be SOL. We can't
> ensure that the passed file is rewindable either. Hmmm.

Given a file-like object, you might not be able to rewind anyway. This might
be another argument for just parsing line by line, but does that make using
the module too complex and error-prone?

We probably have to provide some use-case examples. Putting the whole
operation in a try/except/finally block with the file close in finally is
probably the safe way to do this type of operation.

In the PEP we need to make it clear the benefits of the csv module over a
user simply trying to use split(',') and such, which I think Skip has
already done to a certain extent. We are also trying to address export as
well which is actually quite important. If people simply try and export with
only a simplistic understanding of the edge cases, then they potentially end
up with unusable csv files.

This is the same kind of thing you see with XML where people start writing
out <tag>data</tag> or whatever thinking that is all there is to it and then
they end up with something that isn't really XML. I wouldn't be surprised if
there is more invalid XML out there than valid.

In our case I think we are identifying some pretty clearly defined dialects
of csv, that if you use those you are going to be in good shape. We will
also be able to tell someone whether in fact a file is well-formed and/or
throw an exception if it doesn't match the chosen dialect, which again,
seems simple, but that's a pretty big deal.

Ugh, I need sleep, any stupidity above is just me being tired ;-)

ka


From djc at object-craft.com.au  Thu Jan 30 11:06:23 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 30 Jan 2003 21:06:23 +1100
Subject: [Csv] CSV interface question
In-Reply-To: <15928.4659.449989.410123@montanaro.dyndns.org>
References: <20030129060501.DB9193C1F4@coffee.object-craft.com.au>
	<20030129101658.4E95C3C1F4@coffee.object-craft.com.au>
	<m3adhk9qmh.fsf@ferret.object-craft.com.au>
	<1043861462.16012.46.camel@software1.logiplex.internal>
	<15928.4659.449989.410123@montanaro.dyndns.org>
Message-ID: <m3wuknar4w.fsf@ferret.object-craft.com.au>

>>>>> "Skip" == Skip Montanaro <skip at pobox.com> writes:

Cliff> Actually, there is a downside to using strings, as you will see
Cliff> if you look at the code I posted a little while ago.  By taking
Cliff> dialect as a string, it basically precludes the user rolling
Cliff> their own dialect except as keyword arguments.  After working
Cliff> on this, I'm inclined to have the programmer pass a class or
Cliff> other structure.

Skip> Don't forget we have the speedy Object Craft _csv engine sitting
Skip> underneath the covers.  Under the assumption that all the actual
Skip> processing goes on at that level, I see no particular reason
Skip> dialect info needs to be anything other than a collection of
Skip> keyword arguments.  I view csv.reader and csv.writer as factory
Skip> functions which return functional readers and writers defined in
Skip> _csv.c.  The Python level serves simply to paper over the
Skip> low-level extension module.

I have been going through the messages again to see if I can build up
a TODO list.

I missed something on the first reading of this message.

In the current version of the code sitting in the sandbox the reader
factory is actually a class:

    class reader(OCcvs):
        def __init__(self, fileobj, dialect = 'excel2000', **options):
            self.fileobj = fileobj
            OCcvs.__init__(self, dialect, **options)

        def __iter__(self):
            return self

        def next(self):
            while 1:
                fields = self.parser.parse(self.fileobj.next())
                if fields:
                    return fields

You message above talks about the _csv parser exposing the iterator
interface, not the Python layer.  I wonder how much of a measurable
performance difference there would be by leaving the code as is.

- Dave

-- 
http://www.object-craft.com.au


From djc at object-craft.com.au  Thu Jan 30 11:56:36 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 30 Jan 2003 21:56:36 +1100
Subject: [Csv] Status
In-Reply-To: <20030130072457.25FEB3C32B@coffee.object-craft.com.au>
References: <15928.37531.445243.692589@montanaro.dyndns.org>
	<20030130031254.D2E853C32B@coffee.object-craft.com.au>
	<15928.40341.991680.82247@montanaro.dyndns.org>
	<20030130035141.271EA3C32B@coffee.object-craft.com.au>
	<15928.42338.298849.316715@montanaro.dyndns.org>
	<20030130055839.64E903C32B@coffee.object-craft.com.au>
	<15928.49474.186478.320826@montanaro.dyndns.org>
	<20030130072457.25FEB3C32B@coffee.object-craft.com.au>
Message-ID: <m3r8auc3dn.fsf@ferret.object-craft.com.au>

>>>>> "Andrew" == Andrew McNamara <andrewm at object-craft.com.au> writes:

>> I'd like not to enumerate all the possible keyword parameters,
>> especially since that list may grow.  How should I write the
>> synopsis?
>> 
>> reader(fileobj [, dialect='excel2000'] [, keyword parameters])
>> 
>> ?

Andrew> Maybe make it "optional keyword parameters"... implied, I
Andrew> know, but...

[I have been franticly trying to reread all of the messages again.
Other work has made me fall behind and lose context.]

Is there any harm in just doing something like this:

  The basic reading interface is::

      reader(fileobj [, **kwargs])

  The dialect keyword argument identifies the CSV dialect which will
  be implemented by the reader.  The dialect corresponds to a set of
  parameters which are set in the low level CSV parsing engine.

  Variants of a dialect can be specified by passing additional keyword
  arguments which serve to override the parameters defined by the
  dialect argument.  The parser parameters are catalogued below.

Andrew> - we could make the dialect parameter accept either a string
Andrew> dialect name or a dialect instance - is this a good idea?

+1 from me.

csv.py::

    class dialect:
        name = None
        quotechar = "'"
        delimiter = ","

    excel2000 = dialect

yourcode.py::

    import csv

    my_dialect = csv.dialect()
    my_dialect.delimiter = '\t'

    # or

    class my_dialect(csv.dialect):
        delimiter = '\t'

    csvreader = csv.reader(file("some.csv"), dialect=my_dialect)

>>  It can pretty easily do both.  Perhaps we should present the pros
>> and cons in the PEP and see what kind of feedback we get.

Andrew> Sometimes you can give people too much choice. We don't have
Andrew> time for an endless discussion. If we don't think we're going
Andrew> to be crucified, we should just pick something that's
Andrew> tasteful. Dave?

If we had to choose one, I would say pass a class or instance rather
than a string.

Andrew> - regarding the dialect list function - this probably should
Andrew> be called list_dialects(), yes?

>>  Where do you see dialect_list()?  Maybe I need to cvs up.  In any
>> case, I like list_dialects() better.

Andrew> Ah - I mean "dialect list function" in the generic sense - we
Andrew> need one, and I was proposing to call it list_dialects, or
Andrew> maybe that should be listdialects to be like listdir... nah,
Andrew> looks ugly.

+1 list_dialects()

Andrew> - should we call the delimiter parameter "field_sep" instead
Andrew> (I notice you haven't used underscores in the parameter names
Andrew> - is this deliberate)?

>>  I don't have a big preference one way or the other.  I've been
>> calling it "delimiter" though.

+1 delimiter (I think :-)

- Dave

-- 
http://www.object-craft.com.au


From djc at object-craft.com.au  Thu Jan 30 12:31:06 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 30 Jan 2003 22:31:06 +1100
Subject: [Csv] Module question...
In-Reply-To: <KJEOLDOPMIDKCMJDCNDPMEDGCNAA.altis@semi-retired.com>
References: <KJEOLDOPMIDKCMJDCNDPMEDGCNAA.altis@semi-retired.com>
Message-ID: <m3n0lic1s5.fsf@ferret.object-craft.com.au>

>>>>> "Kevin" == Kevin Altis <altis at semi-retired.com> writes:

>> From: Andrew McNamara
>> 
>> >> The way we've speced it, the module only deals with file
>> objects. I >> wonder if there's any need to deal with strings,
>> rather than files?
>> 
>> BTW, I'm asking this because it's something that will come back to
>> haunt us if we get it wrong - it's something we need to make the
>> right call on.

Kevin> Agreed, in fact I'm now reconsidering my position.

When I originally wrote the Object Craft parser I thought about these
things too.  I eventually settled on the current interface.  To use
the stuff in CVS now, this is what the interface looks like:

    csvreader = _csv.parser()
    for line in file("some.csv"):
        row = csvreader.parse(line)
        if row:
            process(row)

The reason I settled on this interface was that it placed only the
performance critical code into the extension module.  All policy
decisions about where the CSV data would come from were pushed back
into the application.

The current PEP is only a slight variation on this, but it is a nice
variation.  The variation pushes the conditional in the loop into the
reader and thereby exposes a much nicer interface.

Hmmm...  The argument to the PEP reader() should not be a file object,
it should be an iterator which returns lines.  There really is no
reason why it should not handle the following:

    lines = ('1,2,3,"""I see,""\n',
             'said the blind man","as he picked up his\n',
             'hammer and saw"\n')

    csvreader = csv.reader(lines)
    for row in csvreader:
        process(row)

>> >One other possibility would be for the parser to only deal with
>> >one row at a time, leaving it up to the user code to feed the
>> >parser the row strings. But given the various possible line
>> >endings for a row of data and the fact that a column of a row may
>> >contain a line ending, not to mention all the other escape
>> >character issues we've discussed, this would be error-prone.
>>
>> This is the way the Object Craft module has worked - it works well
>> enough, and the universal end-of-line stuff in 2.3 makes it more
>> seamless. Not saying I'm wedded to this scheme, but I'd just like
>> to have clear why we've chosen one over the other.

You might have missed it but the Object Craft parser is designed to be
fed one line at a time.  It actually raises an exception if you pass
more than one line to it.  Internally it collects fields from lines
until it detects end of record, at which point it returns the record
to the caller.

>> I'm trying to think of an example where operating on a file-like
>> object would be too restricting, and I can't - oh, here's one: what
>> if you wanted to do some pre-processing on the data (say it was
>> uuencoded)?

I think this could be solved by changing the reader() fileobj argument
to an iterable.

>> >The solution was to simply accept a file-like object and let the
>> >parser do the interpretation of a record. By having the parser
>> >present an iterable interface, the user code still gets the
>> >convenience of processing per row if needed or if no processing is
>> >desired a result list can easily be obtained.

Is this the same thing as what I said above?

>> Should the object just be defined as an iteratable, and leave
>> closing, etc, up to the user of the module? One downside of this is
>> you can't rewind an iterator, so things like the sniffer would be
>> SOL. We can't ensure that the passed file is rewindable
>> either. Hmmm.

Application code will just have to be aware of this and arrange to do
something like the following:

    sniffer_input = [fileobj.readline() for i in range(20)]

    dialect = csvutils.sniff(sniffer_input)
    csvreader = csv.reader(sniffer_input, dialect=dialect)
    for row in csvreader:
        process(row)

Then we have two problems (our principle weapons are surprise and fear):

   * The sniffer_input might have a partial record (multi-line record
     spanning last line read out of file).

   * We do not have a way to continue using a reader with additional
     input.

   * The list comprehension be longer than the file :-)

This could be solved by exposing a further method on the reader.

    sniffer_input = [fileobj.readline() for i in range(20)]

    dialect = csvutils.sniff(sniffer_input)
    csvreader = csv.reader(sniffer_input, dialect=dialect)
    for row in  csvreader:
        process(row)
    # now continue on with the rest of the file
    csvreader.use(fileobj)
    for row in  csvreader:
        process(row)

Given the above, is it reasonable to say that the above logic could be
hardened and placed into a csvutils function?

- Dave

-- 
http://www.object-craft.com.au


From djc at object-craft.com.au  Thu Jan 30 13:13:10 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 30 Jan 2003 23:13:10 +1100
Subject: [Csv] Made some changes to the PEP
Message-ID: <m3isw6bzu1.fsf@ferret.object-craft.com.au>

Here is the commit message:

Trying to bring PEP up to date with discussions on mailing list.  I hope I
have not misinterpreted the conclusions.
* dialect argument is now either a string to identify one of the internally
  defined parameter sets, otherwise it is an object which contains
  attributes which correspond to the parameter set.
* Altered set_dialect() to take dialect name and dialect object.
* Altered get_dialect() to take dialect name and return dialect object.
* Fleshed out formatting parameters, adding escapechar, lineterminator,
  quoting.

- Dave

-- 
http://www.object-craft.com.au


From skip at pobox.com  Thu Jan 30 13:21:18 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu, 30 Jan 2003 06:21:18 -0600
Subject: [Csv] We have archives
Message-ID: <15929.6334.716837.600555@montanaro.dyndns.org>

Thanks to Andrew saving messages, we have archives.  There are probably a
few duplicates around the transition to MM 2.1, but I decided to not worry
about it.

Skip

From skip at pobox.com  Thu Jan 30 13:28:39 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu, 30 Jan 2003 06:28:39 -0600
Subject: [Csv] change of Sender address
In-Reply-To: <KJEOLDOPMIDKCMJDCNDPEEDECNAA.altis@semi-retired.com>
References: <KJEOLDOPMIDKCMJDCNDPEEDECNAA.altis@semi-retired.com>
Message-ID: <15929.6775.921999.863765@montanaro.dyndns.org>


    Kevin> the mailing list Sender: is now csv-bounces at mail.mojam.com while
    Kevin> previously it was csv-admin at mail.mojam.com. Is that intentional?

It appears to be a side effect of the transition from Mailman 2.0.9 to
Mailman 2.1.  I suspect it was intentional on Barry Warsaw's part. ;-)

The old version of the list had these aliases:

    csv
    csv-admin
    csv-request
    csv-owner

while the new version has many more:

    csv
    csv-admin
    csv-bounces
    csv-confirm
    csv-join
    csv-leave
    csv-owner
    csv-request
    csv-subscribe
    csv-unsubscribe

It seems the system now has more fine-grained control over the disposition
of admin messages.

Skip


From skip at pobox.com  Thu Jan 30 13:37:46 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu, 30 Jan 2003 06:37:46 -0600
Subject: [Csv] Module question... 
In-Reply-To: <20030130073352.A55953C32B@coffee.object-craft.com.au>
References: <KJEOLDOPMIDKCMJDCNDPAEDECNAA.altis@semi-retired.com>
        <20030130073352.A55953C32B@coffee.object-craft.com.au>
Message-ID: <15929.7322.744669.187499@montanaro.dyndns.org>


    >> One other possibility would be for the parser to only deal with one
    >> row at a time, leaving it up to the user code to feed the parser the
    >> row strings. But given the various possible line endings for a row of
    >> data and the fact that a column of a row may contain a line ending,
    >> not to mention all the other escape character issues we've discussed,
    >> this would be error-prone.

    Andrew> This is the way the Object Craft module has worked - it works
    Andrew> well enough, and the universal end-of-line stuff in 2.3 makes it
    Andrew> more seamless. Not saying I'm wedded to this scheme, but I'd
    Andrew> just like to have clear why we've chosen one over the other.

You have to be careful.  I think the Universal eol stuff might bit you in
the arse here.  Recall that in Excel, the default line terminator (record
separator?) is CRLF, but that a hard return within a cell is simply LF.  I
don't know what Universal eol handling will do with that.  In any case,
because you have to have full control over line termination, I think you
have to start dealing just with binary files.

    Andrew> I'm trying to think of an example where operating on a file-like
    Andrew> object would be too restricting, and I can't - oh, here's one:
    Andrew> what if you wanted to do some pre-processing on the data (say it
    Andrew> was uuencoded)?

Then you force the user to uudecode the file and stuff it into a StringIO
object. ;-)

    Andrew> Should the object just be defined as an iteratable, 

I had envisioned that the object the csv.reader() factory function (or
class) returned would be an iterable and that the object the csv.writer()
factory function (or class) returned would accept an iterable.

    Andrew> closing, etc, up to the user of the module? One downside of this
    Andrew> is you can't rewind an iterator, so things like the sniffer
    Andrew> would be SOL. We can't ensure that the passed file is rewindable
    Andrew> either. Hmmm.

The sniffer is going to be in a csvutils module, correct?  It could
certainly have either accept a filename or a string containing some subset
of the rows in the file to be sniffed.  I see no reason to constrain it to
the csv.reader()'s interface.

Skip

From djc at object-craft.com.au  Thu Jan 30 13:40:33 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 30 Jan 2003 23:40:33 +1100
Subject: [Csv] Moving _csv.c closer to PEP
Message-ID: <m37kcmbyke.fsf@ferret.object-craft.com.au>

In the process of fixing _csv.c so it will handle the parameters
specified in the PEP I came across yet another configurable dialect
setting.

    doublequote

        When True quotechar in a field value is represented by two
        consecutive quotechar.

I will continue fixing _csv.c on the assumption that we want to keep
this tweakable parameter.

- Dave

-- 
http://www.object-craft.com.au


From skip at pobox.com  Thu Jan 30 13:54:43 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu, 30 Jan 2003 06:54:43 -0600
Subject: [Csv] Made some changes to the PEP
In-Reply-To: <m3isw6bzu1.fsf@ferret.object-craft.com.au>
References: <m3isw6bzu1.fsf@ferret.object-craft.com.au>
Message-ID: <15929.8339.826486.231614@montanaro.dyndns.org>


    Dave> Here is the commit message:

    Dave> Trying to bring PEP up to date with discussions on mailing list...

Much appreciated.  I just added a todo section near the top.  Anyone can
feel free to add to the list or take care of any items.

Skip

From skip at pobox.com  Thu Jan 30 13:57:27 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu, 30 Jan 2003 06:57:27 -0600
Subject: [Csv] Cutting out for a bit...
Message-ID: <15929.8503.814261.580267@montanaro.dyndns.org>


As masochistic as it may seem, I am currently working on two PEPs.  I'm
going to cut out for awhile to work on PEP 304.  I need to make some
progress on that it it's going to have more than a snowball's chance in hell
of making it into 2.3.

At some point today it would be good if we could announce PEP 305 to the
world and start to get some feedback from the unwashed masses.

Skip

From djc at object-craft.com.au  Thu Jan 30 14:17:59 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 31 Jan 2003 00:17:59 +1100
Subject: [Csv] Moving _csv.c closer to PEP
In-Reply-To: <m37kcmbyke.fsf@ferret.object-craft.com.au>
References: <m37kcmbyke.fsf@ferret.object-craft.com.au>
Message-ID: <m3y952ai9k.fsf@ferret.object-craft.com.au>

>>>>> "Dave" == Dave Cole <djc at object-craft.com.au> writes:

Dave> In the process of fixing _csv.c so it will handle the parameters
Dave> specified in the PEP I came across yet another configurable
Dave> dialect setting.

Dave>     doublequote

Dave>         When True quotechar in a field value is represented by
Dave> two consecutive quotechar.

Dave> I will continue fixing _csv.c on the assumption that we want to
Dave> keep this tweakable parameter.

Here is the commit message:

* More formatting changes to bring code closer to the Guido style.
* Changed all internal parser settings to match those in the PEP.
* Added PEP settings to allow _csv use by csv.py - new parameters
  are not handled yet (skipinitialspace, lineterminator, quoting).
* Removed overloading of quotechar and escapechar values by introducing
  have_quotechar and have_escapechar attributes.
Barest minimum of testing has been done.

- Dave

-- 
http://www.object-craft.com.au


From djc at object-craft.com.au  Thu Jan 30 14:30:10 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 31 Jan 2003 00:30:10 +1100
Subject: [Csv] Made some changes to the PEP
In-Reply-To: <15929.8339.826486.231614@montanaro.dyndns.org>
References: <m3isw6bzu1.fsf@ferret.object-craft.com.au>
	<15929.8339.826486.231614@montanaro.dyndns.org>
Message-ID: <m3u1fqahp9.fsf@ferret.object-craft.com.au>

>>>>> "Skip" == Skip Montanaro <skip at pobox.com> writes:

Dave> Here is the commit message:

Dave> Trying to bring PEP up to date with discussions on mailing
Dave> list...

Skip> Much appreciated.  I just added a todo section near the top.
Skip> Anyone can feel free to add to the list or take care of any
Skip> items.

>From the TODO:

- Need to complete initial list of formatting parameters and settle on
  names.

This is what I have done in the _csv module:

>>> import _csv
>>> help(_csv)
[snip]
            delimiter
                Defines the character that will be used to separate
                fields in the CSV record.
        
            quotechar
                Defines the character used to quote fields that
                contain the field separator or newlines.  If set to None
                special characters will be escaped using the escapechar.
        
            escapechar
                Defines the character used to escape special
                characters.  Only used if quotechar is None.
        
            doublequote
                When True, quotes in a fields must be doubled up.
        
            skipinitialspace
                When True spaces following the delimiter are ignored.
        
            lineterminator
                The string used to terminate records.
        
            quoting
                Controls the generation of quotes around fields when writing
                records.  This is only used when quotechar is not None.
        
            autoclear
                When True, calling parse() will automatically call
                the clear() method if the previous call to parse() raised an
                exception during parsing.
        
            strict
                When True, the parser will raise an exception on
                malformed fields rather than attempting to guess the right
                behavior.
[snip]

Not sure that we need to keep the last two...

When the parser fails you are able to look at the fields it managed to
parse before the problem was encountered.  This might be useful for
the sniffer.  The autoclear parameter controls whether or not you must
manually clear() the partial record before trying to parse more data.

The strict parameter controls what happens when you see data like
this:

     "blah","oops" blah"

If strict is False then the " after the oops is included as part of
the field 'oops" blah'.  If in strict is True and exception is raised.

- Dave

-- 
http://www.object-craft.com.au


From djc at object-craft.com.au  Thu Jan 30 15:03:57 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 31 Jan 2003 01:03:57 +1100
Subject: [Csv] Devil in the details, including the small one between
	delimiters and quotechars
In-Reply-To: <15928.4083.834299.369381@montanaro.dyndns.org>
References: <1043859517.16012.14.camel@software1.logiplex.internal>
	<15928.4083.834299.369381@montanaro.dyndns.org>
Message-ID: <m3hebqag4y.fsf@ferret.object-craft.com.au>


Checking against the current version of the CSV parser.

Cliff> 1, "not quoted","quoted"

Cliff> It seems reasonable to parse this as:

Cliff> [1, ' "not quoted"', "quoted"]

Cliff> which is the described Excel behavior.

>>> import _csv
>>> p = _csv.parser()
>>> p.parse('1, "not quoted","quoted"')
['1', ' "not quoted"', 'quoted']

Looks OK.

Cliff> Now consider

Cliff> 1,"not quoted" ,"quoted"

Cliff> Is the second field quoted or not?  If it is, do we discard the
Cliff> extraneous whitespace following it or raise an exception?

The current version of the _csv parser can do two things depending
upon the value of the strict parameter.

>>> p.strict  
0
>>> p.parse('1,"not quoted" ,"quoted"')
['1', 'not quoted ', 'quoted']
>>> p.strict = 1
>>> p.parse('1,"not quoted" ,"quoted"')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
_csv.Error: , expected after "

Skip> Well, there's always the, "be flexible in what you accept,
Skip> strict in what you generate" school of thought.  In the above,
Skip> that would suggest the list returned would be

Skip>     ['1', 'not quoted', 'quoted']

Why wouldn't you include the trailing space on the second field?

Andrew, what does Excel do here?

Hmm...  I was sort of expecting _csv to do this:

['1', 'not quoted" ', 'quoted']


Skip> It seems like a minor formatting glitch.  How about a warning?
Skip> Or a "strict" flag for the parser?

I think that there are enough variations here that strict is not
enough.  The second one does look a bit bogus...

['1', '"not quoted" ', 'quoted']
['1', 'not quoted" ', 'quoted']
['1', 'not quoted ', 'quoted']

Cliff> Worse, consider this
Cliff> "quoted", "not quoted, but this ""field"" has delimiters and quotes"

Skip> Depends on the setting of skipinitialspaces.  If false, you get
Skip>   ['quoted', ' "not quoted', ' but this ""field"" has delimiters and quotes"']

parser does this:

['quoted', ' "not quoted', ' but this ""field"" has delimiters and quotes"']

Skip> if True, I think you get

Skip>   ['quoted', 'not quoted, but this "field" has delimiters and quotes']

Yeah, but the doublequote stuff is only meant for quoted fields (or is
it).

Cliff> How should this parse?  I say free exceptions for everyone.

Don't know if exceptions are what we need.  We just need to come up
with parameters which control the parser to sufficient detail to
handle the dialect variations.

Cliff> I propose space between delimiters and quotes raise an exception
Cliff> and let's be done with it.  I don't think this really affects
Cliff> Excel compatibility since Excel will never generate this type of
Cliff> file and doesn't require it for import.  It's true that some
Cliff> files that Excel would import (probably incorrectly) won't import
Cliff> in CSV, but I think that's outside the scope of Excel
Cliff> compatibility.

Skip> Sounds good to me.

I dunno.  We should look at the corner cases and handle as many as we
can in the dialect.  That is sort of the whole point of why we are
here.

Cliff> Anyway, I know no one has said "On your mark, get set" yet, but I
Cliff> can't think without code sitting in front of me, breaking worse
Cliff> with every keystroke, so in addition to creating some test cases,
Cliff> I've hacked up a very preliminary CSV module so we have something
Cliff> to play with.  I was up til 6am so if there's anything odd, I
Cliff> blame it on lack of sleep and the feverish optimism and glossing
Cliff> of detail that comes with it.

Skip> Perhaps you and Dave were in a race but didn't know it? ;-)

When Skip mentioned that we were going to have the speedy Object Craft
parser I just checked in the _csv module.  It does not handle all of
what we have been discussing, but it is close.

- Dave

-- 
http://www.object-craft.com.au


From djc at object-craft.com.au  Thu Jan 30 15:18:50 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 31 Jan 2003 01:18:50 +1100
Subject: [Csv] Re: First Cut at CSV PEP
In-Reply-To: <KJEOLDOPMIDKCMJDCNDPCECPCNAA.altis@semi-retired.com>
References: <KJEOLDOPMIDKCMJDCNDPCECPCNAA.altis@semi-retired.com>
Message-ID: <m3d6meafg5.fsf@ferret.object-craft.com.au>

>>>>> "Kevin" == Kevin Altis <altis at semi-retired.com> writes:

>> Exceptions should be raised for anything else, even None.  An empty
>> field is "".

[snip]

>> I really still dislike this whole None thing.  Whose use case is
>> that anyway?

[snip]

Kevin> Are you saying that you want to throw an exception instead?
Kevin> Booleans may also present a problem. I was mostly thinking in
Kevin> terms of importing and exporting data from embedded databases
Kevin> like MetaKit, my own list of dictionaries (flatfile stuff),
Kevin> PySQLite, Gadfly. Anyway, the implication might be that it is
Kevin> necessary for the user to sanitize data as part of the export
Kevin> operation too. Have to ponder that.

The penny finally dropped!!!

The None thing and the implicit __str__ conversion is there in the
Object Craft parser to be compatible with the DB-API.  Consider the
following code (which is close to something I had to do a couple of
years ago):

    import csv
    import Sybase

    db = Sybase.connect(server, user, passwd, database)
    c = db.cursor()
    c.execute('select some stuff from the database')

    p = csv.parser()
    fp = open('results.csv', 'w')
    for row in c.fetchall():
        fp.write(p.join(row))
        fp.write('\n')

We would be doing it slightly better now:

    import csv
    import Sybase

    db = Sybase.connect(server, user, passwd, database)
    c = db.cursor()
    c.execute('select some stuff from the database')

    csvwriter = csv.writer(file('results.csv', 'w'))
    for row in c.fetchall():
        csvwriter.write(row)

Or even:

    import csv
    import Sybase

    db = Sybase.connect(server, user, passwd, database)
    c = db.cursor()
    c.execute('select some stuff from the database')

    csvwriter = csv.writer(file('results.csv', 'w'))
    csvwriter.writelines(c)

Now without the implicit __str__ and conversion of None to '' we would
require a shirtload of code to do the same thing, only it would be as
slow as a slug on valium.

- Dave

-- 
http://www.object-craft.com.au


From skip at pobox.com  Thu Jan 30 15:18:44 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu, 30 Jan 2003 08:18:44 -0600
Subject: [Csv] Completely off-topic...
In-Reply-To: <m3n0lic1s5.fsf@ferret.object-craft.com.au>
References: <KJEOLDOPMIDKCMJDCNDPMEDGCNAA.altis@semi-retired.com>
        <m3n0lic1s5.fsf@ferret.object-craft.com.au>
Message-ID: <15929.13380.487790.427756@montanaro.dyndns.org>

Saving useful commentary for later...

    Dave>     ...""""I see,""\n',
    Dave>              'said the blind man","as he picked up his\n',
    Dave>              'hammer and saw"\n')

My father used to use this expression all the time.  I have no idea of its
origins (though his dad was a carpenter and he started out life as one).
He's been dead and gone for over 30 years now so I can't easily ask him.
Any time I've used it people always looked at me like I was nuts.  This is
the first instance where I've seen actually encountered another person using
it.

Skip


From djc at object-craft.com.au  Thu Jan 30 15:20:27 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 31 Jan 2003 01:20:27 +1100
Subject: [Csv] Re: Completely off-topic...
In-Reply-To: <15929.13380.487790.427756@montanaro.dyndns.org>
References: <KJEOLDOPMIDKCMJDCNDPMEDGCNAA.altis@semi-retired.com>
	<m3n0lic1s5.fsf@ferret.object-craft.com.au>
	<15929.13380.487790.427756@montanaro.dyndns.org>
Message-ID: <m38yx2afdg.fsf@ferret.object-craft.com.au>


Skip> Saving useful commentary for later...

Dave>     ...""""I see,""\n',
Dave>              'said the blind man","as he picked up his\n',
Dave>              'hammer and saw"\n')

Skip> My father used to use this expression all the time.  I have no
Skip> idea of its origins (though his dad was a carpenter and he
Skip> started out life as one).  He's been dead and gone for over 30
Skip> years now so I can't easily ask him.  Any time I've used it
Skip> people always looked at me like I was nuts.  This is the first
Skip> instance where I've seen actually encountered another person
Skip> using it.

Your mind is failing...

        http://www.object-craft.com.au/projects/csv/

:-)

- Dave

-- 
http://www.object-craft.com.au


From skip at pobox.com  Thu Jan 30 15:25:58 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu, 30 Jan 2003 08:25:58 -0600
Subject: [Csv] Moving _csv.c closer to PEP
In-Reply-To: <m37kcmbyke.fsf@ferret.object-craft.com.au>
References: <m37kcmbyke.fsf@ferret.object-craft.com.au>
Message-ID: <15929.13814.339184.359208@montanaro.dyndns.org>


    Dave> In the process of fixing _csv.c so it will handle the parameters
    Dave> specified in the PEP I came across yet another configurable
    Dave> dialect setting.

    Dave>     doublequote

    Dave>         When True quotechar in a field value is represented by two
    Dave>         consecutive quotechar.

Isn't that implied as long as quoting is not "never" and escapechar is None?
If so, and we decide to have a separate doublequote parameter anyway,
checking that relationship should be part of validating the parameter set.

Speaking of doubling things, can the low-level partser support
mulit-character quotechar or delimiter strings?  Recall I mentioned the
previous client who didn't quote anything in their private file format and
used ::: as the field separator.

Skip


From skip at pobox.com  Thu Jan 30 15:40:31 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu, 30 Jan 2003 08:40:31 -0600
Subject: [Csv] Devil in the details, including the small one between
        delimiters and quotechars
In-Reply-To: <m3hebqag4y.fsf@ferret.object-craft.com.au>
References: <1043859517.16012.14.camel@software1.logiplex.internal>
        <15928.4083.834299.369381@montanaro.dyndns.org>
        <m3hebqag4y.fsf@ferret.object-craft.com.au>
Message-ID: <15929.14687.742062.136173@montanaro.dyndns.org>


    Dave> The current version of the _csv parser can do two things depending
    Dave> upon the value of the strict parameter.

    >>> p.strict  
    0
    >>> p.parse('1,"not quoted" ,"quoted"')
    ['1', 'not quoted ', 'quoted']

Hmmm...  I think this is wrong.  You treated " as the quote character but
tacked the space onto the field even though it occurred after the " which
should have terminated the field.  I would have expected:

    ['1', 'not quoted', 'quoted']

Barfing when p.strict == 1 seems correct to me.

    Skip> ['1', 'not quoted', 'quoted']

    Dave> Why wouldn't you include the trailing space on the second field?

Because the quoting tells you the field has ended.

    Dave> I think that there are enough variations here that strict is not
    Dave> enough.

I think that when strict == 0, extra whitespace between the terminating
quote and the delimiter or between the delimiter and the first quote should
be discarded.  If the field is not quoted, leading or trailing whitespace is
ignored.  I think that makes the treatment of whitespace near delimiters
uniform (principle of least surprise?).  If that's not what the user wants,
she can damn well set the strict flag to True and catch the exception. ;-)

(Speaking of exceptions, should there be a field in _csv.Error which holds
the raw text which causes the exception?)


    Skip> Depends on the setting of skipinitialspaces.  If false, you get
    Skip> ['quoted', ' "not quoted', ' but this ""field"" has delimiters and quotes"']

    Dave> parser does this:

    Dave> ['quoted', ' "not quoted', ' but this ""field"" has delimiters and quotes"']

    Skip> if True, I think you get

    Skip> ['quoted', 'not quoted, but this "field" has delimiters and quotes']

    Dave> Yeah, but the doublequote stuff is only meant for quoted fields
    Dave> (or is it).

Damn, yeah.  Maybe we have overspecified the parameter set.  Do we need both
strict and skipinitialspaces?  I'd say keep strict and dump
skipinitialspaces, then define fairly precisely what to do when
strict==False.

    Cliff> I propose space between delimiters and quotes raise an exception
    Cliff> and let's be done with it.  I don't think this really affects
    Cliff> Excel compatibility since Excel will never generate this type of
    Cliff> file and doesn't require it for import.  It's true that some
    Cliff> files that Excel would import (probably incorrectly) won't import
    Cliff> in CSV, but I think that's outside the scope of Excel
    Cliff> compatibility.

    Skip> Sounds good to me.

I can never remember my past train of thought from one day to the next. :-(

can-you-hear-me-waffling?-ly y'rs,

Skip

From skip at pobox.com  Thu Jan 30 15:51:03 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu, 30 Jan 2003 08:51:03 -0600
Subject: [Csv] Re: First Cut at CSV PEP
In-Reply-To: <m3d6meafg5.fsf@ferret.object-craft.com.au>
References: <KJEOLDOPMIDKCMJDCNDPCECPCNAA.altis@semi-retired.com>
        <m3d6meafg5.fsf@ferret.object-craft.com.au>
Message-ID: <15929.15319.901753.91284@montanaro.dyndns.org>


    Dave> The None thing and the implicit __str__ conversion is there in the
    Dave> Object Craft parser to be compatible with the DB-API....

Hmmm...  I've used MySQLdb and psycopg and don't recall my queries returning
None.  (He furiously searches for None in PEP 249...)  Ah, I see:

    SQL NULL values are represented by the Python None singleton on input
    and output.

I generally have always defined my fields to have defaults and usually also
declare them NOT NULL, so I wouldn't expect to see None in my query results.

Still, the current treatment of None doesn't successfully round-trip
("select * ...", dump to csv, load from csv, repopulate database).  Do you
distinguish an empty field from a quoted field printed as ""?  That is, are
these output rows different?

    5.0,,"Mary, Mary, quite contrary"\r\n
    5.0,"","Mary, Mary, quite contrary"\r\n

the former parsing into

    [5.0, None, "Mary, Mary, quite contrary"]

and the latter into

    [5.0, "", "Mary, Mary, quite contrary"]

?

    Dave> Now without the implicit __str__ and conversion of None to '' we
    Dave> would require a shirtload of code to do the same thing, only it
    Dave> would be as slow as a slug on valium.

How about we let the user define how to handle None?  I would always want
None's appearing in my data to raise and exception.  You clearly have a use
case for automatically mapping to the empty string.

Skip


From skip at pobox.com  Thu Jan 30 15:52:24 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu, 30 Jan 2003 08:52:24 -0600
Subject: [Csv] Re: Completely off-topic...
In-Reply-To: <m38yx2afdg.fsf@ferret.object-craft.com.au>
References: <KJEOLDOPMIDKCMJDCNDPMEDGCNAA.altis@semi-retired.com>
        <m3n0lic1s5.fsf@ferret.object-craft.com.au>
        <15929.13380.487790.427756@montanaro.dyndns.org>
        <m38yx2afdg.fsf@ferret.object-craft.com.au>
Message-ID: <15929.15400.755945.151484@montanaro.dyndns.org>


    Skip> This is the first instance where I've seen actually encountered
    Skip> another person using it.

    Dave> Your mind is failing...

    Dave>         http://www.object-craft.com.au/projects/csv/

    Dave> :-)

I never read the instructions.  I just click the "Download" link. ;-)

Skip

From LogiplexSoftware at earthlink.net  Thu Jan 30 18:57:45 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 30 Jan 2003 09:57:45 -0800
Subject: [Csv] Status
In-Reply-To: <15928.37531.445243.692589@montanaro.dyndns.org>
References: <15928.37531.445243.692589@montanaro.dyndns.org>
Message-ID: <1043949465.16012.101.camel@software1.logiplex.internal>

On Wed, 2003-01-29 at 18:48, Skip Montanaro wrote:
> It would appear we are converging on dialects as data-only classes
> (subclassable but with no methods).  I'll update the PEP.  Many other ideas
> have been floating through the list, and while I haven't been deleting the
> messages, I haven't been adding them to the PEP either.  Can someone help
> with that?

A comment on the dialect classes:  I think a validate() method would be
good in the base dialect class.  A separate validate function would do
just as well, but it seems logical to make it part of the class.

> I'd like to get the wording in the PEP to converge on our current thoughts
> and announce it on c.l.py and python-dev sometime tomorrow.  I think we will
> get a lot of feedback from both camps, hopefully some of it useful. ;-)

Undoubtedly Timothy Rue will inform us that we are wasting our time as
the VIC will solve this problem as well (after all, input->9
commands->output), but if you think you can live with that, sure.

> I just finished making a pass through the messages I hadn't deleted (and
> then saved them to a csv mbox file since the list appears to still not be
> archiving).  Here's what I think we've concluded:
> 
>     * Dialects are a set of defaults, probably implemented as classes (which
>       allows subclassing, whereas dicts wouldn') and the default dialect
>       named as something like csv.dialects.excel or "excel" if we allow
>       string specifiers.  (I think strings work well at the API, simply
>       because they are shorter and can more easily be presented in GUI
>       tools.)

Agreed.  Just to clarify, these strings will still be stored in a
dictionary ("settings" or "dialects")?

>     * A csvutils module should be at least scoped out which might do a fair
>       number of things:
> 
>       - Implements one or more sniffers for parameter types
> 
>       - Validates CSV files (e.g., constant number of columns, type
>         constraints on column values, compares against given dialect)
> 
>       - Generate a sniffer from a CSV file
> 
>     * These individual parameters are necessary (hopefully the names will be
>       enough clue as to there meaning): quote_char, quoting ("auto",
>       "always", "nonnumeric", "never"), delimiter, line_terminator,
>       skip_whitespace, escape_char, hard_return.  Are there others?
> 
>     * We're still undecided about None (I certainly don't think it's a valid
>       value to be writing to CSV files)

IMO, None should be mapped to '', so [None, None, None] would be saved
as ,, or "","","" if quoting="always".  I can't think of any reasonable
alternative.  However, it is arguable whether reading ,, should return
[None,None,None] or ['','',''].  I'd vote for the latter since we
explicitly are not doing conversions between strings and Python types
('6' doesn't become 6).

>     * Rows can have variable numbers of columns and the application is
>       responsible for deciding on and enforcing max_rows or max_cols.
> 
>     * Don't raise exceptions needlessly.  For example, specifying
>       quoting="never" and not specifying a value for escape_char would be
>       okay until you encounter a field when writing which contains the
>       delimiter.
>
>     * Files have to be opened in binary mode (we can check the mode
>       attribute I believe) so we can do the right thing with line
>       terminators.
> 
>     * Data values should always be returned as strings, even if they are
>       valid numbers.  Let the application do data conversion.
> 
> Other stuff we haven't talked about much:
> 
>     * Unicode.  I think we punt on this for now and just pretend that
>       passing codecs.open(csvfile, mode, encoding) is sufficient.  I'm sure
>       Martin von L?wis will let us know if it isn't. ;-) Dave said, "The low
>       level parser (C code) is probably going to need to handle unicode."
>       Let's wait and see how well codecs.open() works for us.
> 
>     * We know we need tests but haven't talked much about them.  I vote for
>       PyUnit as much as possible, though a certain amount of manual testing
>       using existing spreadsheets and databases will be required.

+1.  Testing all the corner cases is going to take some care.

>     * Exceptions.  We know we need some.  We should start with CSVError and
>       try to avoid getting carried away with things.  If need be, we can add
>       a code field to the class.  I don't like the idea of having 17
>       different subclasses of CSVError though.  It's too much complexity for
>       most users.

I can only count to 12 (or was it 11?), so this would be good for me as
well.

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308


From LogiplexSoftware at earthlink.net  Thu Jan 30 20:58:25 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 30 Jan 2003 11:58:25 -0800
Subject: [Csv] Module question...
In-Reply-To: <20030130073352.A55953C32B@coffee.object-craft.com.au>
References: <KJEOLDOPMIDKCMJDCNDPAEDECNAA.altis@semi-retired.com>
	 <20030130073352.A55953C32B@coffee.object-craft.com.au>
Message-ID: <1043956705.16012.112.camel@software1.logiplex.internal>

On Wed, 2003-01-29 at 23:33, Andrew McNamara wrote:

> >> What was the rational for using files, rather making the user do their
> >> own readline(), etc?
> >
> >I'll try and summarize, if this is too simplistic or incorrect I'm sure
> >someone will speak up :)

> >One other possibility would be for the parser to only deal with one row at a
> >time, leaving it up to the user code to feed the parser the row strings. But
> >given the various possible line endings for a row of data and the fact that
> >a column of a row may contain a line ending, not to mention all the other
> >escape character issues we've discussed, this would be error-prone.
> 
> This is the way the Object Craft module has worked - it works well enough,
> and the universal end-of-line stuff in 2.3 makes it more seamless. Not
> saying I'm wedded to this scheme, but I'd just like to have clear why
> we've chosen one over the other.

It simplifies use for the programmer not to have to feed one line at a
time to the parser.  If the programmer needs to generate data one line
at a time, they can pass a pipe and feed data into that.

> I'm trying to think of an example where operating on a file-like object
> would be too restricting, and I can't - oh, here's one: what if you
> wanted to do some pre-processing on the data (say it was uuencoded)?

Then they can uudecode it, write it to a temp file and pass that instead
of the original.  I think the file-like object is the best compromise
between ease-of-use and flexibility.

> >The solution was to simply accept a file-like object and let the parser do
> >the interpretation of a record. By having the parser present an iterable
> >interface, the user code still gets the convenience of processing per row if
> >needed or if no processing is desired a result list can easily be obtained.
> >
> >This should provide the most flexibility while still being easy to use.

Hey, that's what I was thinking <wink>

> Should the object just be defined as an iteratable, and leave closing,
> etc, up to the user of the module? One downside of this is you can't
> rewind an iterator, so things like the sniffer would be SOL. We can't
> ensure that the passed file is rewindable either. Hmmm.

-1.  If it isn't sniffable, I'd end up having to write another CSV
parser to support the features DSV currently has.

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308


From LogiplexSoftware at earthlink.net  Thu Jan 30 21:22:14 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 30 Jan 2003 12:22:14 -0800
Subject: [Csv] Re: First Cut at CSV PEP
In-Reply-To: <15929.15319.901753.91284@montanaro.dyndns.org>
References: <KJEOLDOPMIDKCMJDCNDPCECPCNAA.altis@semi-retired.com>
	 <m3d6meafg5.fsf@ferret.object-craft.com.au>
	 <15929.15319.901753.91284@montanaro.dyndns.org>
Message-ID: <1043958134.15753.132.camel@software1.logiplex.internal>

On Thu, 2003-01-30 at 06:51, Skip Montanaro wrote:
>     Dave> The None thing and the implicit __str__ conversion is there in the
>     Dave> Object Craft parser to be compatible with the DB-API....
> 
> Hmmm...  I've used MySQLdb and psycopg and don't recall my queries returning
> None.  (He furiously searches for None in PEP 249...)  Ah, I see:
> 
>     SQL NULL values are represented by the Python None singleton on input
>     and output.
> 
> I generally have always defined my fields to have defaults and usually also
> declare them NOT NULL, so I wouldn't expect to see None in my query results.
> 
> Still, the current treatment of None doesn't successfully round-trip
> ("select * ...", dump to csv, load from csv, repopulate database).  Do you
> distinguish an empty field from a quoted field printed as ""?  That is, are
> these output rows different?
> 
>     5.0,,"Mary, Mary, quite contrary"\r\n
>     5.0,"","Mary, Mary, quite contrary"\r\n
> 
> the former parsing into
> 
>     [5.0, None, "Mary, Mary, quite contrary"]
> 
> and the latter into
> 
>     [5.0, "", "Mary, Mary, quite contrary"]

I'd suggest *not* mapping anything to any object but a string on
*import*.  CSV files don't have any way of carrying type information
(except perhaps on an application-by-application basis, but I don't
think that's where we're going here) so it's best to treat *everything*
as a string.

Export is a slightly different story.  I do think None should be mapped
to '' on export since that is the only reasonable value for it, and
there are enough existing modules that use None to represent an empty
value that this would be a reasonable thing for us to handle.

> 
>     Dave> Now without the implicit __str__ and conversion of None to '' we
>     Dave> would require a shirtload of code to do the same thing, only it
>     Dave> would be as slow as a slug on valium.
> 
> How about we let the user define how to handle None?  I would always want
> None's appearing in my data to raise and exception.  You clearly have a use
> case for automatically mapping to the empty string.

This might not affect performance too badly if we *always* raise an
exception when passed anything but a string, and do the conversion
(which would involve a table lookup) in the exception handler.  Anything
not in the table would cause the exception to be passed up to the
caller. 

That being said, this might complicate things too much for many people.

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308


From LogiplexSoftware at earthlink.net  Thu Jan 30 21:10:10 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 30 Jan 2003 12:10:10 -0800
Subject: [Csv] Access Products sample
In-Reply-To: <KJEOLDOPMIDKCMJDCNDPGEDDCNAA.altis@semi-retired.com>
References: <KJEOLDOPMIDKCMJDCNDPGEDDCNAA.altis@semi-retired.com>
Message-ID: <1043957410.16012.122.camel@software1.logiplex.internal>

On Wed, 2003-01-29 at 22:33, Kevin Altis wrote:
> I created a db and table in Access (products.mdb) using one of the built-in
> samples. I created two rows, one that is mostly empty. I used the default
> CSV export to create(Products.csv) and also output the table as an Excel
> 97/2000 XLS file (Products.xls). Finally, I had Excel export as CSV
> (ProductsExcel.csv). They are all contained in the attached zip.
> 
> The currency column in the table is actually written out with formatting
> ($5.66 instead of just 5.66). Note that when Excel exports this column it
> has a trailing space for some reason (,$5.66 ,).

So we've actually found an application that puts an extraneous space
around the data, and it's our primary target.  Figures.

> While exporting it reminded me that unless a column in the data set contains
> an embedded newline or carriage return it shouldn't matter whether the file
> is opened in binary mode for reading.
> 
> Without a schema we don't know what each column is supposed to contain, so
> that is outside the domain of the csv import parser and export writer.

Agreed.

> The values exported by both Access and Excel are designed to prevent
> information loss within the constraints of the CSV format, thus a field with
> no value (what I think of as None in Python) is empty in the CSV

Something just occurred to me:  say someone is controlling Excel via
win32com and obtains their data that way.  Do the empty cells in that
list appear as '' or None?  If they do appear as None, then I'd be
inclined to again raise the argument that we should map None => '' on
export.  Unless, of course, someone else has an idea they want to trade
+1 votes on again <wink>

> We should we be able to import and then export using a given dialect, such
> that there would be no differences between the original csv and the exported
> one? Actually, using the Access default of quoting strings it isn't possible
> to do that because it implies having a schema to know that a given column is
> a string. With the Excel csv format it is possible because a column that
> doesn't contain a comma won't be quoted.

I don't think that we need to worry about whether checksum(original) ==
checksum(output) to claim compatibility, only that we can read and write
files compatible with said application.  If they turn out to be
identical, that's just a side-effect ;)

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308


From LogiplexSoftware at earthlink.net  Thu Jan 30 21:23:27 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 30 Jan 2003 12:23:27 -0800
Subject: [Csv] CSV interface question
In-Reply-To: <15928.34442.337899.905054@montanaro.dyndns.org>
References: <20030129060501.DB9193C1F4@coffee.object-craft.com.au>
	 <20030129101658.4E95C3C1F4@coffee.object-craft.com.au>
	 <m3adhk9qmh.fsf@ferret.object-craft.com.au>
	 <1043861462.16012.46.camel@software1.logiplex.internal>
	 <15928.4659.449989.410123@montanaro.dyndns.org>
	 <1043863704.16012.64.camel@software1.logiplex.internal>
	 <15928.6851.934680.995625@montanaro.dyndns.org>
	 <1043867895.16012.87.camel@software1.logiplex.internal>
	 <15928.34442.337899.905054@montanaro.dyndns.org>
Message-ID: <1043958206.15753.134.camel@software1.logiplex.internal>

On Wed, 2003-01-29 at 17:57, Skip Montanaro wrote:
>     Cliff> Consider now the programmer actually defining a new dialect:
>     Cliff> Passing a class or other structure (a dict is fine), they can
>     Cliff> create this on the fly with minimal work.  Using a *string*, they
>     Cliff> must first "register" that string somewhere (probably in the
>     Cliff> mapping we agree upon) before they can actually make the function
>     Cliff> call.  Granted, it's only a an extra step, but it requires a bit
>     Cliff> more knowledge (of the mapping) and doesn't seem to provide a
>     Cliff> real benefit.  If you prefer a mapping to a class, that is fine,
>     Cliff> but lets pass the mapping rather than a string referring to it:
> 
> Somewhere I think we still need to associate string names with these
> beasts.  Maybe it's just another attribute:
> 
>     class dialect:
>         name = None
> 
>     class excel(dialect):
>         name = "excel"
>         ...
> 
> They should all be collected together for operation as a group.  This could
> be so a GUI knows all the names to present or so a sniffer can return all
> the dialects with which a sample file is compatible.  Both operations
> suggest the need to register dialects somehow.

+1 on this.  Hm.  If I keep trying I might get you to agree with
everything just out of exhaustion <wink>

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308


From skip at pobox.com  Thu Jan 30 21:45:26 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu, 30 Jan 2003 14:45:26 -0600
Subject: [Csv] Module question...
In-Reply-To: <1043956705.16012.112.camel@software1.logiplex.internal>
References: <KJEOLDOPMIDKCMJDCNDPAEDECNAA.altis@semi-retired.com>
        <20030130073352.A55953C32B@coffee.object-craft.com.au>
        <1043956705.16012.112.camel@software1.logiplex.internal>
Message-ID: <15929.36582.100675.643804@montanaro.dyndns.org>


    Cliff> -1.  If it isn't sniffable, I'd end up having to write another
    Cliff> CSV parser to support the features DSV currently has.

Or approach the problem differently?  Try asking the low-level parser to
return a few rows of the file using different parameters.  The low-level
parser is fast enough that you can (given a filename) attempt to parse it
many times in fairly short order.  See what works. ;-)

Skip


From skip at pobox.com  Thu Jan 30 22:02:27 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu, 30 Jan 2003 15:02:27 -0600
Subject: [Csv] Re: First Cut at CSV PEP
In-Reply-To: <1043958134.15753.132.camel@software1.logiplex.internal>
References: <KJEOLDOPMIDKCMJDCNDPCECPCNAA.altis@semi-retired.com>
        <m3d6meafg5.fsf@ferret.object-craft.com.au>
        <15929.15319.901753.91284@montanaro.dyndns.org>
        <1043958134.15753.132.camel@software1.logiplex.internal>
Message-ID: <15929.37603.217278.623650@montanaro.dyndns.org>


    Cliff> Export is a slightly different story.  I do think None should be
    Cliff> mapped to '' on export since that is the only reasonable value
    Cliff> for it, and there are enough existing modules that use None to
    Cliff> represent an empty value that this would be a reasonable thing
    Cliff> for us to handle.

How is a database (that was Dave's use case) supposed to distinguish '' as
SQL NULL vs '' as an empty string though?  This is the sort of thing that
bothers me about mapping None to ''.

    Cliff> This might not affect performance too badly if we *always* raise
    Cliff> an exception when passed anything but a string, ...

except float and int values will be prevalent in the data.

Can we limit the data to float, int, plain strings, Unicode and None?  If
so, I think you can just test the object types and do the right thing.  In
the case of None, I'd like to see a parameter which would allow me to flag
that as an error.  The extra complication might be limited to

    map_none_to='some string, possibly empty'

in the writer() constructor and

    interpret_empty_string_as=<any object, maybe None>

in the reader() constructor.

Skip

From skip at pobox.com  Thu Jan 30 22:03:51 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu, 30 Jan 2003 15:03:51 -0600
Subject: [Csv] Access Products sample
In-Reply-To: <1043957410.16012.122.camel@software1.logiplex.internal>
References: <KJEOLDOPMIDKCMJDCNDPGEDDCNAA.altis@semi-retired.com>
        <1043957410.16012.122.camel@software1.logiplex.internal>
Message-ID: <15929.37687.44696.305338@montanaro.dyndns.org>

    >> The currency column in the table is actually written out with
    >> formatting ($5.66 instead of just 5.66). Note that when Excel exports
    >> this column it has a trailing space for some reason (,$5.66 ,).

    Cliff> So we've actually found an application that puts an extraneous
    Cliff> space around the data, and it's our primary target.  Figures.

So we just discovered we need an "access" dialect. ;-)

Skip

From LogiplexSoftware at earthlink.net  Thu Jan 30 22:10:28 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 30 Jan 2003 13:10:28 -0800
Subject: [Csv] Module question...
In-Reply-To: <15929.36582.100675.643804@montanaro.dyndns.org>
References: <KJEOLDOPMIDKCMJDCNDPAEDECNAA.altis@semi-retired.com>
	 <20030130073352.A55953C32B@coffee.object-craft.com.au>
	 <1043956705.16012.112.camel@software1.logiplex.internal>
	 <15929.36582.100675.643804@montanaro.dyndns.org>
Message-ID: <1043961028.15753.148.camel@software1.logiplex.internal>

On Thu, 2003-01-30 at 12:45, Skip Montanaro wrote:
>     Cliff> -1.  If it isn't sniffable, I'd end up having to write another
>     Cliff> CSV parser to support the features DSV currently has.
> 
> Or approach the problem differently?  Try asking the low-level parser to
> return a few rows of the file using different parameters.  The low-level
> parser is fast enough that you can (given a filename) attempt to parse it
> many times in fairly short order.  See what works. ;-)

I'm not sure that would be a good approach, as passing incorrect
arguments to the parser might cause problems (it *is* written in C
<wink>) and given the number of possible variations, it would be
inefficient no matter how fast the parser.  However, it is certainly
possible to sniff the file prior to passing it to the parser.  I suppose
there is no reason the sniffer has to take the same type of file (or
iterator) argument the parser does, although it would be nice, for
consistency.

Okay: -0 to whatever someone said that I was arguing about <wink>.

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308


From LogiplexSoftware at earthlink.net  Thu Jan 30 22:45:53 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 30 Jan 2003 13:45:53 -0800
Subject: [Csv] Re: First Cut at CSV PEP
In-Reply-To: <15929.37603.217278.623650@montanaro.dyndns.org>
References: <KJEOLDOPMIDKCMJDCNDPCECPCNAA.altis@semi-retired.com>
	 <m3d6meafg5.fsf@ferret.object-craft.com.au>
	 <15929.15319.901753.91284@montanaro.dyndns.org>
	 <1043958134.15753.132.camel@software1.logiplex.internal>
	 <15929.37603.217278.623650@montanaro.dyndns.org>
Message-ID: <1043963153.16012.159.camel@software1.logiplex.internal>

On Thu, 2003-01-30 at 13:02, Skip Montanaro wrote:
>     Cliff> Export is a slightly different story.  I do think None should be
>     Cliff> mapped to '' on export since that is the only reasonable value
>     Cliff> for it, and there are enough existing modules that use None to
>     Cliff> represent an empty value that this would be a reasonable thing
>     Cliff> for us to handle.
> 
> How is a database (that was Dave's use case) supposed to distinguish '' as
> SQL NULL vs '' as an empty string though?  This is the sort of thing that
> bothers me about mapping None to ''.

The database not being able to distinguish '' from SQL NULL is inherent
in the file format.  CSV files have no concept of '' vs None vs NULL. 
There is only ,, or ,"", which I think should be considered the same
(because the same data [or lack of] can be expressed either way by
tweaking the quote settings).

If we don't want them to be considered the same, then we need YAO to
specify whether to interpret them differently.  

> 
>     Cliff> This might not affect performance too badly if we *always* raise
>     Cliff> an exception when passed anything but a string, ...
> 
> except float and int values will be prevalent in the data.

Well, right =)

> Can we limit the data to float, int, plain strings, Unicode and None?  If
> so, I think you can just test the object types and do the right thing.  In
> the case of None, I'd like to see a parameter which would allow me to flag
> that as an error.  The extra complication might be limited to
> 
>     map_none_to='some string, possibly empty'

This seems reasonable.

> in the writer() constructor and
> 
>     interpret_empty_string_as=<any object, maybe None>
> 
> in the reader() constructor.

Okay.

> Skip

Sure.


-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308


From LogiplexSoftware at earthlink.net  Thu Jan 30 23:34:32 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 30 Jan 2003 14:34:32 -0800
Subject: [Csv] Re: First Cut at CSV PEP
In-Reply-To: <1043963153.16012.159.camel@software1.logiplex.internal>
References: <KJEOLDOPMIDKCMJDCNDPCECPCNAA.altis@semi-retired.com>
	 <m3d6meafg5.fsf@ferret.object-craft.com.au>
	 <15929.15319.901753.91284@montanaro.dyndns.org>
	 <1043958134.15753.132.camel@software1.logiplex.internal>
	 <15929.37603.217278.623650@montanaro.dyndns.org>
	 <1043963153.16012.159.camel@software1.logiplex.internal>
Message-ID: <1043966071.15753.177.camel@software1.logiplex.internal>

On Thu, 2003-01-30 at 13:45, Cliff Wells wrote:
> On Thu, 2003-01-30 at 13:02, Skip Montanaro wrote:
> >     Cliff> Export is a slightly different story.  I do think None should be
> >     Cliff> mapped to '' on export since that is the only reasonable value
> >     Cliff> for it, and there are enough existing modules that use None to
> >     Cliff> represent an empty value that this would be a reasonable thing
> >     Cliff> for us to handle.
> > 
> > How is a database (that was Dave's use case) supposed to distinguish '' as
> > SQL NULL vs '' as an empty string though?  This is the sort of thing that
> > bothers me about mapping None to ''.
> 
> The database not being able to distinguish '' from SQL NULL is inherent
> in the file format.  CSV files have no concept of '' vs None vs NULL. 
> There is only ,, or ,"", which I think should be considered the same
> (because the same data [or lack of] can be expressed either way by
> tweaking the quote settings).
> 
> If we don't want them to be considered the same, then we need YAO to
> specify whether to interpret them differently.  

Hm.  Something has occurred to me.  How about treating None as a true
null value.  That is, we never quote it. So, even if alwaysquote == true

[1,2,3,'',None] 

would get exported as 

'1','2','3','',

That way the difference between the two is saved in the CSV file. 
Obviously not all programs would be able to take advantage of this
implicit information, but it seems likely some would (does Excel
differentiate between an empty string and a null value?  It wouldn't
surprise me to discover that the '' becomes an empty *character* cell
and the null value is simply ignored).

Clearly this behavior is not desirable in all circumstances.  However,
the workaround in any case is to not have None values in the data to be
exported <wink>.  This punts any possible issues with it back into
user-space.

The only problem I have with this is that the behavior sort of
implicit.  It saves us a couple of options but it puts the settings in
the data which I am not sure is a good idea.

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308


From andrewm at object-craft.com.au  Fri Jan 31 00:19:37 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Fri, 31 Jan 2003 10:19:37 +1100
Subject: [Csv] Devil in the details, including the small one between
	delimiters and quotechars 
In-Reply-To: Message from Dave Cole <djc@object-craft.com.au> 
	<m3hebqag4y.fsf@ferret.object-craft.com.au> 
References: <1043859517.16012.14.camel@software1.logiplex.internal>
	<15928.4083.834299.369381@montanaro.dyndns.org>
	<m3hebqag4y.fsf@ferret.object-craft.com.au> 
Message-ID: <20030130231937.C26B83C32B@coffee.object-craft.com.au>

>Cliff> 1,"not quoted" ,"quoted"
>
>Why wouldn't you include the trailing space on the second field?
>
>Andrew, what does Excel do here?

Excel returns the trailing space, and honours the quote:

['1', 'not quoted ', 'quoted']

I've checked that it does this consistently (at end of line, etc).

>Hmm...  I was sort of expecting _csv to do this:
>
>['1', 'not quoted" ', 'quoted']

That would have been something I fixed when doing the extensive Excel
comparison - it's one of the tests.

>Cliff> Worse, consider this
>Cliff> "quoted", "not quoted, but this ""field"" has delimiters and quotes"
>
>Skip> Depends on the setting of skipinitialspaces.  If false, you get
>Skip>   ['quoted', ' "not quoted', ' but this ""field"" has delimiters and quotes"']
>
>parser does this:
>
>['quoted', ' "not quoted', ' but this ""field"" has delimiters and quotes"']

If we implement the "leading whitespace strip" then it would return:

['quoted', 'not quoted, but this "field" has delimiters and quotes']

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From andrewm at object-craft.com.au  Fri Jan 31 00:23:05 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Fri, 31 Jan 2003 10:23:05 +1100
Subject: [Csv] Weird dialects?
Message-ID: <20030130232305.7D1253C32B@coffee.object-craft.com.au>

Something that occured to me last night - we might find that there are
strange dialects that we can't easily parse with the C parser (without
make it ugly). It occured to me that maybe the dialect should contain
some sort of specification of the parser to use. But my feeling is that
if it's too hard to parse with the C parser, it isn't a CSV file, and
it should therefore be someone else's problem. Agreed?

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From LogiplexSoftware at earthlink.net  Fri Jan 31 00:33:09 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 30 Jan 2003 15:33:09 -0800
Subject: [Csv] Weird dialects?
In-Reply-To: <20030130232305.7D1253C32B@coffee.object-craft.com.au>
References: <20030130232305.7D1253C32B@coffee.object-craft.com.au>
Message-ID: <1043969589.15753.181.camel@software1.logiplex.internal>

On Thu, 2003-01-30 at 15:23, Andrew McNamara wrote:
> Something that occured to me last night - we might find that there are
> strange dialects that we can't easily parse with the C parser (without
> make it ugly). It occured to me that maybe the dialect should contain
> some sort of specification of the parser to use. But my feeling is that
> if it's too hard to parse with the C parser, it isn't a CSV file, and
> it should therefore be someone else's problem. Agreed?

Now there's a concrete definition of CSV <wink>

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308


From andrewm at object-craft.com.au  Fri Jan 31 00:35:01 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Fri, 31 Jan 2003 10:35:01 +1100
Subject: [Csv] Devil in the details, including the small one between
	delimiters and quotechars 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15929.14687.742062.136173@montanaro.dyndns.org> 
References: <1043859517.16012.14.camel@software1.logiplex.internal>
	<15928.4083.834299.369381@montanaro.dyndns.org>
	<m3hebqag4y.fsf@ferret.object-craft.com.au>
	<15929.14687.742062.136173@montanaro.dyndns.org> 
Message-ID: <20030130233501.6DD5C3C32B@coffee.object-craft.com.au>

>    >>> p.parse('1,"not quoted" ,"quoted"')
>    ['1', 'not quoted ', 'quoted']
>
>Hmmm...  I think this is wrong.  You treated " as the quote character but
>tacked the space onto the field even though it occurred after the " which
>should have terminated the field.  I would have expected:

"Wrong" it might be, but that's what Excel does...

>Damn, yeah.  Maybe we have overspecified the parameter set.  Do we need both
>strict and skipinitialspaces?  I'd say keep strict and dump
>skipinitialspaces, then define fairly precisely what to do when
>strict==False.

I'd go for fine grained in the back end module - remember we have the
"dialects" stuff to hide the complexity from the average user. 

If anything, strict should be broken up so a given flag only enables
one feature.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From andrewm at object-craft.com.au  Fri Jan 31 00:39:14 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Fri, 31 Jan 2003 10:39:14 +1100
Subject: [Csv] Re: First Cut at CSV PEP 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15929.15319.901753.91284@montanaro.dyndns.org> 
References: <KJEOLDOPMIDKCMJDCNDPCECPCNAA.altis@semi-retired.com>
	<m3d6meafg5.fsf@ferret.object-craft.com.au>
	<15929.15319.901753.91284@montanaro.dyndns.org> 
Message-ID: <20030130233914.A854F3C32B@coffee.object-craft.com.au>

>How about we let the user define how to handle None?  I would always want
>None's appearing in my data to raise and exception.  You clearly have a use
>case for automatically mapping to the empty string.

Maybe just add an "allow_none" flag - if false, it raise an exception
on None, if true emits a null string? Sure it doesn't survive the round
trip - if you care, you probably should post/pre process data. We can't
be all things to all people.

As mentioned earlier - True and False are also potentially a problem.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From andrewm at object-craft.com.au  Fri Jan 31 00:43:37 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Fri, 31 Jan 2003 10:43:37 +1100
Subject: [Csv] Status 
In-Reply-To: Message from Cliff Wells <LogiplexSoftware@earthlink.net> 
	<1043949465.16012.101.camel@software1.logiplex.internal> 
References: <15928.37531.445243.692589@montanaro.dyndns.org>
	<1043949465.16012.101.camel@software1.logiplex.internal> 
Message-ID: <20030130234337.230973C32B@coffee.object-craft.com.au>

>A comment on the dialect classes:  I think a validate() method would be
>good in the base dialect class.  A separate validate function would do
>just as well, but it seems logical to make it part of the class.

The underlying C module currently validates all the options and will raise
an exception if an unknown option is set, etc. Should we change this - I'd
hate to duplicate the tests?

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From djc at object-craft.com.au  Fri Jan 31 00:57:04 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 31 Jan 2003 10:57:04 +1100
Subject: [Csv] Devil in the details, including the small one between
	delimiters and quotechars
In-Reply-To: <20030130233501.6DD5C3C32B@coffee.object-craft.com.au>
References: <1043859517.16012.14.camel@software1.logiplex.internal>
	<15928.4083.834299.369381@montanaro.dyndns.org>
	<m3hebqag4y.fsf@ferret.object-craft.com.au>
	<15929.14687.742062.136173@montanaro.dyndns.org>
	<20030130233501.6DD5C3C32B@coffee.object-craft.com.au>
Message-ID: <m3wukmgpin.fsf@ferret.object-craft.com.au>

>>>>> "Andrew" == Andrew McNamara <andrewm at object-craft.com.au> writes:

>> >>> p.parse('1,"not quoted" ,"quoted"') ['1', 'not quoted ',
>> 'quoted']
>> 
>> Hmmm...  I think this is wrong.  You treated " as the quote
>> character but tacked the space onto the field even though it
>> occurred after the " which should have terminated the field.  I
>> would have expected:

Andrew> "Wrong" it might be, but that's what Excel does...

I thought so.

How are we going to go about building up some dialect test cases?

>> Damn, yeah.  Maybe we have overspecified the parameter set.  Do we
>> need both strict and skipinitialspaces?  I'd say keep strict and
>> dump skipinitialspaces, then define fairly precisely what to do
>> when strict==False.

Andrew> I'd go for fine grained in the back end module - remember we
Andrew> have the "dialects" stuff to hide the complexity from the
Andrew> average user.

Andrew> If anything, strict should be broken up so a given flag only
Andrew> enables one feature.

+1 I agree with that.

Until we have a few dialects and a test suite we should hold off on
trying to lock down all of the parameters.  That would be placing the
cart before the horse in my opinion.

- Dave

-- 
http://www.object-craft.com.au


From djc at object-craft.com.au  Fri Jan 31 01:04:24 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 31 Jan 2003 11:04:24 +1100
Subject: [Csv] Re: First Cut at CSV PEP
In-Reply-To: <15929.15319.901753.91284@montanaro.dyndns.org>
References: <KJEOLDOPMIDKCMJDCNDPCECPCNAA.altis@semi-retired.com>
	<m3d6meafg5.fsf@ferret.object-craft.com.au>
	<15929.15319.901753.91284@montanaro.dyndns.org>
Message-ID: <m3smvagp6f.fsf@ferret.object-craft.com.au>

>>>>> "Skip" == Skip Montanaro <skip at pobox.com> writes:

Dave> Now without the implicit __str__ and conversion of None to '' we
Dave> would require a shirtload of code to do the same thing, only it
Dave> would be as slow as a slug on valium.

Skip> How about we let the user define how to handle None?  I would
Skip> always want None's appearing in my data to raise and exception.
Skip> You clearly have a use case for automatically mapping to the
Skip> empty string.

I suspect that programs which combine the DB-API and CSV files are
probably quite common.  I agree that the round trip fails, but not all
of those programs need to make the round trip.

What the current behaviour does is "solve" the following:

        DB-API -> CSV 

I think you would find it hard to come up with a meaningful way to
handle NULL columns for any variant of

        CSV -> DB-API

Regardless of the source of the CSV.

The only thing I can think of which makes even partial sense is the
following field translation (for CSV -> DB-API):

        null -> None
        "null" -> "null"

Does that mean that we should have an option on the reader/writer
which provides this functionality?  I don't know.  I would probably
use it if it were there.

- Dave

-- 
http://www.object-craft.com.au


From skip at pobox.com  Fri Jan 31 01:10:13 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu, 30 Jan 2003 18:10:13 -0600
Subject: [Csv] Re: First Cut at CSV PEP
In-Reply-To: <1043966071.15753.177.camel@software1.logiplex.internal>
References: <KJEOLDOPMIDKCMJDCNDPCECPCNAA.altis@semi-retired.com>
        <m3d6meafg5.fsf@ferret.object-craft.com.au>
        <15929.15319.901753.91284@montanaro.dyndns.org>
        <1043958134.15753.132.camel@software1.logiplex.internal>
        <15929.37603.217278.623650@montanaro.dyndns.org>
        <1043963153.16012.159.camel@software1.logiplex.internal>
        <1043966071.15753.177.camel@software1.logiplex.internal>
Message-ID: <15929.48869.249366.775005@montanaro.dyndns.org>

    Cliff> Hm.  Something has occurred to me.  How about treating None as a
    Cliff> true null value.  That is, we never quote it. So, even if
    Cliff> alwaysquote == true

    Cliff> [1,2,3,'',None] 

    Cliff> would get exported as 

    Cliff> '1','2','3','',

Too fragile, methinks.  Also, I I've said before, in my application domain
at least, trying to write None to a CSV file is a bug.

Skip

From djc at object-craft.com.au  Fri Jan 31 01:12:31 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 31 Jan 2003 11:12:31 +1100
Subject: [Csv] Re: First Cut at CSV PEP
In-Reply-To: <15929.37603.217278.623650@montanaro.dyndns.org>
References: <KJEOLDOPMIDKCMJDCNDPCECPCNAA.altis@semi-retired.com>
	<m3d6meafg5.fsf@ferret.object-craft.com.au>
	<15929.15319.901753.91284@montanaro.dyndns.org>
	<1043958134.15753.132.camel@software1.logiplex.internal>
	<15929.37603.217278.623650@montanaro.dyndns.org>
Message-ID: <m3of5ygosw.fsf@ferret.object-craft.com.au>


    Cliff> Export is a slightly different story.  I do think None should be
    Cliff> mapped to '' on export since that is the only reasonable value
    Cliff> for it, and there are enough existing modules that use None to
    Cliff> represent an empty value that this would be a reasonable thing
    Cliff> for us to handle.

Skip> How is a database (that was Dave's use case) supposed to
Skip> distinguish '' as SQL NULL vs '' as an empty string though?
Skip> This is the sort of thing that bothers me about mapping None to
Skip> ''.

    Cliff> This might not affect performance too badly if we *always* raise
    Cliff> an exception when passed anything but a string, ...

Skip> except float and int values will be prevalent in the data.

Skip> Can we limit the data to float, int, plain strings, Unicode and
Skip> None?  If so, I think you can just test the object types and do
Skip> the right thing.  In the case of None, I'd like to see a
Skip> parameter which would allow me to flag that as an error.  The
Skip> extra complication might be limited to
Skip> 
Skip>     map_none_to='some string, possibly empty'
Skip> 
Skip> in the writer() constructor and
Skip> 
Skip>     interpret_empty_string_as=<any object, maybe None>
Skip> 
Skip> in the reader() constructor.

I think that we should have an option (or set of options) which causes
the following:

* In the writer, export None as the unquoted string 'null'.

* In the write, export the string 'null' as the quoted string "null".

* In the reader, import the unquoted string 'null' as None.

* In the reader, import of the quoted string "null" as 'null'.

This solves the ambiguity for the case when we are in control of the
round trip.  When we are not in control of the round trip all bets are
off anyway since there is no standard (that I know of) for expressing
this.

- Dave


-- 
http://www.object-craft.com.au


From andrewm at object-craft.com.au  Fri Jan 31 01:16:00 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Fri, 31 Jan 2003 11:16:00 +1100
Subject: [Csv] Status 
In-Reply-To: Message from Andrew McNamara <andrewm@object-craft.com.au> 
	<20030130035141.271EA3C32B@coffee.object-craft.com.au> 
References: <15928.37531.445243.692589@montanaro.dyndns.org>
	<20030130031254.D2E853C32B@coffee.object-craft.com.au>
	<15928.40341.991680.82247@montanaro.dyndns.org>
	<20030130035141.271EA3C32B@coffee.object-craft.com.au> 
Message-ID: <20030131001600.C81D83C32B@coffee.object-craft.com.au>

>>I can live with that.  I would propose then that escape_char default to
>>something reasonable, not None.
>
>That's a little hairy, because the resulting file can't be parsed
>correctly by Excel. But it should be safe if the escape_char is only
>emitted if quote is set to none.

Hmmm - I just realised this isn't safe where the excel dialect is
concerned - excel does no special processing of backslash, so our parser
shouldn't either.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From djc at object-craft.com.au  Fri Jan 31 01:19:34 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 31 Jan 2003 11:19:34 +1100
Subject: [Csv] Weird dialects?
In-Reply-To: <20030130232305.7D1253C32B@coffee.object-craft.com.au>
References: <20030130232305.7D1253C32B@coffee.object-craft.com.au>
Message-ID: <m3isw6goh5.fsf@ferret.object-craft.com.au>

>>>>> "Andrew" == Andrew McNamara <andrewm at object-craft.com.au> writes:

Andrew> Something that occured to me last night - we might find that
Andrew> there are strange dialects that we can't easily parse with the
Andrew> C parser (without make it ugly). It occured to me that maybe
Andrew> the dialect should contain some sort of specification of the
Andrew> parser to use. But my feeling is that if it's too hard to
Andrew> parse with the C parser, it isn't a CSV file, and it should
Andrew> therefore be someone else's problem. Agreed?

Why not allow the parser factory function to be an optional argument
to the reader and writer factory functions?

    class csvreader:
        def __init__(self, fileobj, dialect='excel2000', parser=_csv.parser,
                     **options):
            :
            self.parser = parser(**parser_options)

This would allow pluggable parsers.

- Dave

-- 
http://www.object-craft.com.au


From djc at object-craft.com.au  Fri Jan 31 01:21:38 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 31 Jan 2003 11:21:38 +1100
Subject: [Csv] Status
In-Reply-To: <20030131001600.C81D83C32B@coffee.object-craft.com.au>
References: <15928.37531.445243.692589@montanaro.dyndns.org>
	<20030130031254.D2E853C32B@coffee.object-craft.com.au>
	<15928.40341.991680.82247@montanaro.dyndns.org>
	<20030130035141.271EA3C32B@coffee.object-craft.com.au>
	<20030131001600.C81D83C32B@coffee.object-craft.com.au>
Message-ID: <m3d6megodp.fsf@ferret.object-craft.com.au>

>>>>> "Andrew" == Andrew McNamara <andrewm at object-craft.com.au> writes:

>>> I can live with that.  I would propose then that escape_char
>>> default to something reasonable, not None.
>>  That's a little hairy, because the resulting file can't be parsed
>> correctly by Excel. But it should be safe if the escape_char is
>> only emitted if quote is set to none.

Andrew> Hmmm - I just realised this isn't safe where the excel dialect
Andrew> is concerned - excel does no special processing of backslash,
Andrew> so our parser shouldn't either.

That is why for the 'excel2000' dialect you set the escapechar to
None.  Excel has no escapechar so we do not set one in the parser.

Am I missing something?

- Dave

-- 
http://www.object-craft.com.au


From andrewm at object-craft.com.au  Fri Jan 31 01:25:57 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Fri, 31 Jan 2003 11:25:57 +1100
Subject: [Csv] Weird dialects? 
In-Reply-To: Message from Dave Cole <djc@object-craft.com.au> 
	<m3isw6goh5.fsf@ferret.object-craft.com.au> 
References: <20030130232305.7D1253C32B@coffee.object-craft.com.au>
	<m3isw6goh5.fsf@ferret.object-craft.com.au> 
Message-ID: <20030131002557.8ABCB3C32B@coffee.object-craft.com.au>

>Andrew> Something that occured to me last night - we might find that
>Andrew> there are strange dialects that we can't easily parse with the
>Andrew> C parser (without make it ugly). It occured to me that maybe
>Andrew> the dialect should contain some sort of specification of the
>Andrew> parser to use. But my feeling is that if it's too hard to
>Andrew> parse with the C parser, it isn't a CSV file, and it should
>Andrew> therefore be someone else's problem. Agreed?
>
>Why not allow the parser factory function to be an optional argument
>to the reader and writer factory functions?
>
>    class csvreader:
>        def __init__(self, fileobj, dialect='excel2000', parser=_csv.parser,
>                     **options):
>            :
>            self.parser = parser(**parser_options)
>
>This would allow pluggable parsers.

Well, that's essentially what I was suggesting, but I suspect it's too
much flexibility - we're not trying to build a general purpose parser
framework. And on further thought, this is something that can be addressed
later, if need be.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From djc at object-craft.com.au  Fri Jan 31 01:28:08 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 31 Jan 2003 11:28:08 +1100
Subject: [Csv] Moving _csv.c closer to PEP
In-Reply-To: <15929.13814.339184.359208@montanaro.dyndns.org>
References: <m37kcmbyke.fsf@ferret.object-craft.com.au>
	<15929.13814.339184.359208@montanaro.dyndns.org>
Message-ID: <m38yx2go2v.fsf@ferret.object-craft.com.au>

>>>>> "Skip" == Skip Montanaro <skip at pobox.com> writes:

Dave> In the process of fixing _csv.c so it will handle the parameters
Dave> specified in the PEP I came across yet another configurable
Dave> dialect setting.

Dave>  doublequote

Dave>    When True quotechar in a field value is represented by two
Dave>    consecutive quotechar.

Skip> Isn't that implied as long as quoting is not "never" and
Skip> escapechar is None?  If so, and we decide to have a separate
Skip> doublequote parameter anyway, checking that relationship should
Skip> be part of validating the parameter set.

Checking against a dialect, or just as a collection of parameters?

I think we are fast reaching the point where the only meaningful way
forward is to start collecting dialects.

Skip> Speaking of doubling things, can the low-level partser support
Skip> mulit-character quotechar or delimiter strings?  Recall I
Skip> mentioned the previous client who didn't quote anything in their
Skip> private file format and used ::: as the field separator.

Currently the parser only handles single character quotechar,
delimiter, and escapechar.

I suspect that quotechar, delimiter, and escapechar of more than a
single character might be stretching the bounds of what you could
reasonably call a CSV parser.

- Dave

-- 
http://www.object-craft.com.au


From skip at pobox.com  Fri Jan 31 01:39:06 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu, 30 Jan 2003 18:39:06 -0600
Subject: [Csv] Re: First Cut at CSV PEP 
In-Reply-To: <20030130233914.A854F3C32B@coffee.object-craft.com.au>
References: <KJEOLDOPMIDKCMJDCNDPCECPCNAA.altis@semi-retired.com>
        <m3d6meafg5.fsf@ferret.object-craft.com.au>
        <15929.15319.901753.91284@montanaro.dyndns.org>
        <20030130233914.A854F3C32B@coffee.object-craft.com.au>
Message-ID: <15929.50602.984909.597305@montanaro.dyndns.org>


    Andrew> Maybe just add an "allow_none" flag

Good enough for me

    Andrew> As mentioned earlier - True and False are also potentially a
    Andrew> problem.

You could allow_booleans which would have them write as True and False
(those will be grokked by many SQL dialects), otherwise they map to 1 and 0.

Skip

From LogiplexSoftware at earthlink.net  Fri Jan 31 01:48:48 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 30 Jan 2003 16:48:48 -0800
Subject: [Csv] Status
In-Reply-To: <20030130234337.230973C32B@coffee.object-craft.com.au>
References: <15928.37531.445243.692589@montanaro.dyndns.org>
	 <1043949465.16012.101.camel@software1.logiplex.internal>
	 <20030130234337.230973C32B@coffee.object-craft.com.au>
Message-ID: <1043974128.16012.184.camel@software1.logiplex.internal>

On Thu, 2003-01-30 at 15:43, Andrew McNamara wrote:
> >A comment on the dialect classes:  I think a validate() method would be
> >good in the base dialect class.  A separate validate function would do
> >just as well, but it seems logical to make it part of the class.
> 
> The underlying C module currently validates all the options and will raise
> an exception if an unknown option is set, etc. Should we change this - I'd
> hate to duplicate the tests?

I think having it outside the parser is preferable since it allows for
easier customization (especially for the user).  I can't think of any
useful cases off the top of my head, but my over-engineering instinct
tells me this is so <wink>.

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308


From andrewm at object-craft.com.au  Fri Jan 31 02:27:05 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Fri, 31 Jan 2003 12:27:05 +1100
Subject: [Csv] StringIO a bit of a lemon...
Message-ID: <20030131012705.9EC753C32B@coffee.object-craft.com.au>

Not only does StringIO lack a "mode" attribute, it also can't be used as
an iterator (like real file objects), as it lacks a .next() method. This
is somewhat annoying: if we accept an iterator, rather than specifically
a file, it makes the module more generally useful.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From djc at object-craft.com.au  Fri Jan 31 03:00:05 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 31 Jan 2003 13:00:05 +1100
Subject: [Csv] Access Products sample
In-Reply-To: <15929.37687.44696.305338@montanaro.dyndns.org>
References: <KJEOLDOPMIDKCMJDCNDPGEDDCNAA.altis@semi-retired.com>
	<1043957410.16012.122.camel@software1.logiplex.internal>
	<15929.37687.44696.305338@montanaro.dyndns.org>
Message-ID: <m3adhif596.fsf@ferret.object-craft.com.au>

>>>>> "Skip" == Skip Montanaro <skip at pobox.com> writes:

>>> The currency column in the table is actually written out with
>>> formatting ($5.66 instead of just 5.66). Note that when Excel
>>> exports this column it has a trailing space for some reason
>>> (,$5.66 ,).

Cliff> So we've actually found an application that puts an extraneous
Cliff> space around the data, and it's our primary target.  Figures.

Skip> So we just discovered we need an "access" dialect. ;-)

Not really.  Python has no concept of currency types (last time I
looked).  The '$5.66 ' thing is an artifact of converting currency to
string, not float to string.

- Dave

-- 
http://www.object-craft.com.au


From andrewm at object-craft.com.au  Fri Jan 31 03:06:15 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Fri, 31 Jan 2003 13:06:15 +1100
Subject: [Csv] StringIO a bit of a lemon... 
In-Reply-To: Message from Andrew McNamara <andrewm@object-craft.com.au> 
	<20030131012705.9EC753C32B@coffee.object-craft.com.au> 
References: <20030131012705.9EC753C32B@coffee.object-craft.com.au> 
Message-ID: <20030131020615.6CAC03C32B@coffee.object-craft.com.au>

>Not only does StringIO lack a "mode" attribute, it also can't be used as
>an iterator (like real file objects), as it lacks a .next() method. This
>is somewhat annoying: if we accept an iterator, rather than specifically
>a file, it makes the module more generally useful.

Ignore me. I should be calling iter(fileobj).

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From altis at semi-retired.com  Fri Jan 31 05:35:00 2003
From: altis at semi-retired.com (Kevin Altis)
Date: Thu, 30 Jan 2003 20:35:00 -0800
Subject: [Csv] Moving _csv.c closer to PEP
In-Reply-To: <m38yx2go2v.fsf@ferret.object-craft.com.au>
Message-ID: <KJEOLDOPMIDKCMJDCNDPGEFFCNAA.altis@semi-retired.com>

> From: Dave Cole
>
> Skip> Speaking of doubling things, can the low-level partser support
> Skip> mulit-character quotechar or delimiter strings?  Recall I
> Skip> mentioned the previous client who didn't quote anything in their
> Skip> private file format and used ::: as the field separator.
>
> Currently the parser only handles single character quotechar,
> delimiter, and escapechar.
>
> I suspect that quotechar, delimiter, and escapechar of more than a
> single character might be stretching the bounds of what you could
> reasonably call a CSV parser.

Agreed! Double-byte Unicode characters would still be one character in case
we do have to do something special for unicode support.

ka


From andrewm at object-craft.com.au  Fri Jan 31 06:01:14 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Fri, 31 Jan 2003 16:01:14 +1100
Subject: [Csv] Added some tests
Message-ID: <20030131050114.BCF583C32B@coffee.object-craft.com.au>

If you've missed the check-in message, I've added some tests finally
(essentially just the tests from the Object Craft CSV module stripped
down to just those relevant for the excel dialect).

I'm thinking we should organise the tests as:

 - a bunch of tests for each dialect
 - a bunch of tests for each backend parser

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From andrewm at object-craft.com.au  Fri Jan 31 06:10:32 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Fri, 31 Jan 2003 16:10:32 +1100
Subject: [Csv] Excel - trademark...
Message-ID: <20030131051032.45F053C32B@coffee.object-craft.com.au>

Are we going to get into any trademark poo by calling the dialect
"excel"? Should we call it something else to avoid problems (sigh)?

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From djc at object-craft.com.au  Fri Jan 31 12:55:47 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 31 Jan 2003 22:55:47 +1100
Subject: [Csv] Excel - trademark...
In-Reply-To: <20030131051032.45F053C32B@coffee.object-craft.com.au>
References: <20030131051032.45F053C32B@coffee.object-craft.com.au>
Message-ID: <m3lm11jzy4.fsf@ferret.object-craft.com.au>

>>>>> "Andrew" == Andrew McNamara <andrewm at object-craft.com.au> writes:

Andrew> Are we going to get into any trademark poo by calling the
Andrew> dialect "excel"? Should we call it something else to avoid
Andrew> problems (sigh)?

Dunno.  Importers in applications for foreign application data files
usually name the foreign application.

I just fired up Gnumeric and looked at the import dialog.  It says
        "MS Excel (tm)"

Should we call the dialect "excel(tm)"?

- Dave

-- 
http://www.object-craft.com.au


From skip at pobox.com  Fri Jan 31 13:10:57 2003
From: skip at pobox.com (Skip Montanaro)
Date: Fri, 31 Jan 2003 06:10:57 -0600
Subject: [Csv] Re: [Python-checkins] python/nondist/sandbox/csv csv.py,1.4,1.5
In-Reply-To: <E18eSSt-0003fu-00@sc8-pr-cvs1.sourceforge.net>
References: <E18eSSt-0003fu-00@sc8-pr-cvs1.sourceforge.net>
Message-ID: <15930.26577.952898.246807@montanaro.dyndns.org>


    andrew> Modified Files:
    andrew>     csv.py 
    andrew> Log Message:
    andrew> Rename dialects from excel2000 to excel. Rename Error to be
    andrew> CSVError.  Explicity fetch iterator in reader class, rather than
    andrew> simply calling next() (which only works for self-iterators).

Minor nit.  I think Error was fine.  That's the standard for most extension
modules.  I would normally import csv then reference its objects through it.
csv.CSVError looks redundant to me.  I'm not a "from csv import CSVError"
kind of guy however, so I can understand the desire to make the name more
explicit when considered alone.

Skip

From skip at pobox.com  Fri Jan 31 14:07:01 2003
From: skip at pobox.com (Skip Montanaro)
Date: Fri, 31 Jan 2003 07:07:01 -0600
Subject: [Csv] Excel - trademark...
In-Reply-To: <20030131051032.45F053C32B@coffee.object-craft.com.au>
References: <20030131051032.45F053C32B@coffee.object-craft.com.au>
Message-ID: <15930.29941.147355.904094@montanaro.dyndns.org>


    Andrew> Are we going to get into any trademark poo by calling the
    Andrew> dialect "excel"? Should we call it something else to avoid
    Andrew> problems (sigh)?

I wouldn't worry about it.  Here's a CPAN search for Excel:

    cpan> i /Excel/
    Distribution    I/IS/ISTERIN/XML-Excel-0.02.tar.gz
    Distribution    I/IS/ISTERIN/XML-SAXDriver-Excel-0.06.tar.gz
    Distribution    J/JM/JMCNAMARA/Spreadsheet-WriteExcel-0.40.tar.gz
    Distribution    K/KW/KWITKNR/DBD-Excel-0.06.tar.gz
    Distribution    K/KW/KWITKNR/Spreadsheet-ParseExcel-0.2602.tar.gz
    Distribution    R/RK/RKITOVER/Spreadsheet-ParseExcel_XLHTML-0.02.tar.gz
    Distribution    T/TM/TMTM/Spreadsheet-ParseExcel-Simple-1.01.tar.gz
    Distribution    T/TM/TMTM/Spreadsheet-WriteExcel-FromDB-0.09.tar.gz
    Distribution    T/TM/TMTM/Spreadsheet-WriteExcel-Simple-0.03.tar.gz
    Module          DBD::Excel      (K/KW/KWITKNR/DBD-Excel-0.06.tar.gz)
    Module          Spreadsheet::Excel (Contact Author Rachel McGregor Rawlings <rachel at wuxtry.com>)
    Module          Spreadsheet::ParseExcel (K/KW/KWITKNR/Spreadsheet-ParseExcel-0.2602.tar.gz)
    Module          Spreadsheet::ParseExcel::Dump (K/KW/KWITKNR/Spreadsheet-ParseExcel-0.2602.tar.gz)
    Module          Spreadsheet::ParseExcel::FmtDefault (K/KW/KWITKNR/Spreadsheet-ParseExcel-0.2602.tar.gz)
    Module          Spreadsheet::ParseExcel::FmtJapan (K/KW/KWITKNR/Spreadsheet-ParseExcel-0.2602.tar.gz)
    Module          Spreadsheet::ParseExcel::FmtJapan2 (K/KW/KWITKNR/Spreadsheet-ParseExcel-0.2602.tar.gz)
    Module          Spreadsheet::ParseExcel::FmtUnicode (K/KW/KWITKNR/Spreadsheet-ParseExcel-0.2602.tar.gz)
    Module          Spreadsheet::ParseExcel::SaveParser (K/KW/KWITKNR/Spreadsheet-ParseExcel-0.2602.tar.gz)
    Module          Spreadsheet::ParseExcel::Simple (T/TM/TMTM/Spreadsheet-ParseExcel-Simple-1.01.tar.gz)
    Module          Spreadsheet::ParseExcel::Utility (K/KW/KWITKNR/Spreadsheet-ParseExcel-0.2602.tar.gz)
    Module          Spreadsheet::ParseExcel_XLHTML (R/RK/RKITOVER/Spreadsheet-ParseExcel_XLHTML-0.02.tar.gz)
    Module          Spreadsheet::WriteExcel (J/JM/JMCNAMARA/Spreadsheet-WriteExcel-0.40.tar.gz)
    Module          Spreadsheet::WriteExcel::BIFFwriter (J/JM/JMCNAMARA/Spreadsheet-WriteExcel-0.40.tar.gz)
    Module          Spreadsheet::WriteExcel::Big (J/JM/JMCNAMARA/Spreadsheet-WriteExcel-0.40.tar.gz)
    Module          Spreadsheet::WriteExcel::Format (J/JM/JMCNAMARA/Spreadsheet-WriteExcel-0.40.tar.gz)
    Module          Spreadsheet::WriteExcel::Formula (J/JM/JMCNAMARA/Spreadsheet-WriteExcel-0.40.tar.gz)
    Module          Spreadsheet::WriteExcel::FromDB (T/TM/TMTM/Spreadsheet-WriteExcel-FromDB-0.09.tar.gz)
    Module          Spreadsheet::WriteExcel::FromDB::Oracle (T/TM/TMTM/Spreadsheet-WriteExcel-FromDB-0.09.tar.gz)
    Module          Spreadsheet::WriteExcel::FromDB::Pg (T/TM/TMTM/Spreadsheet-WriteExcel-FromDB-0.09.tar.gz)
    Module          Spreadsheet::WriteExcel::FromDB::column_finder (T/TM/TMTM/Spreadsheet-WriteExcel-FromDB-0.09.tar.gz)
    Module          Spreadsheet::WriteExcel::FromDB::mysql (T/TM/TMTM/Spreadsheet-WriteExcel-FromDB-0.09.tar.gz)
    Module          Spreadsheet::WriteExcel::FromDB::sybase (T/TM/TMTM/Spreadsheet-WriteExcel-FromDB-0.09.tar.gz)
    Module          Spreadsheet::WriteExcel::OLEwriter (J/JM/JMCNAMARA/Spreadsheet-WriteExcel-0.40.tar.gz)
    Module          Spreadsheet::WriteExcel::Simple (T/TM/TMTM/Spreadsheet-WriteExcel-Simple-0.03.tar.gz)
    Module          Spreadsheet::WriteExcel::Utility (J/JM/JMCNAMARA/Spreadsheet-WriteExcel-0.40.tar.gz)
    Module          Spreadsheet::WriteExcel::Workbook (J/JM/JMCNAMARA/Spreadsheet-WriteExcel-0.40.tar.gz)
    Module          Spreadsheet::WriteExcel::WorkbookBig (J/JM/JMCNAMARA/Spreadsheet-WriteExcel-0.40.tar.gz)
    Module          Spreadsheet::WriteExcel::Worksheet (J/JM/JMCNAMARA/Spreadsheet-WriteExcel-0.40.tar.gz)
    Module          Win32::ShellExt::ExcelToClipboard (J/JB/JBNIVOIT/Win32-ShellExt-0.1.zip)
    Module          XML::Excel      (I/IS/ISTERIN/XML-Excel-0.02.tar.gz)
    Module          XML::SAXDriver::Excel (I/IS/ISTERIN/XML-SAXDriver-Excel-0.06.tar.gz)
    41 items found

In short, Microsoft will have a field day with the Perl folks long before
they notice us.

Skip

From LogiplexSoftware at earthlink.net  Fri Jan 31 19:17:21 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 31 Jan 2003 10:17:21 -0800
Subject: [Csv] Access Products sample
In-Reply-To: <m3adhif596.fsf@ferret.object-craft.com.au>
References: <KJEOLDOPMIDKCMJDCNDPGEDDCNAA.altis@semi-retired.com>
	 <1043957410.16012.122.camel@software1.logiplex.internal>
	 <15929.37687.44696.305338@montanaro.dyndns.org>
	 <m3adhif596.fsf@ferret.object-craft.com.au>
Message-ID: <1044037040.15753.190.camel@software1.logiplex.internal>

On Thu, 2003-01-30 at 18:00, Dave Cole wrote:
> >>>>> "Skip" == Skip Montanaro <skip at pobox.com> writes:
> 
> >>> The currency column in the table is actually written out with
> >>> formatting ($5.66 instead of just 5.66). Note that when Excel
> >>> exports this column it has a trailing space for some reason
> >>> (,$5.66 ,).
> 
> Cliff> So we've actually found an application that puts an extraneous
> Cliff> space around the data, and it's our primary target.  Figures.
> 
> Skip> So we just discovered we need an "access" dialect. ;-)
> 
> Not really.  Python has no concept of currency types (last time I
> looked).  The '$5.66 ' thing is an artifact of converting currency to
> string, not float to string.

I'm not sure what you mean.  A trailing space is a trailing space,
regardless of data type.  In this case, it isn't too important as the
data isn't quoted (we can just consider the space part of the data), but
it shows that extraneous spaces might not be outside the scope of our
problem.

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308


From skip at pobox.com  Fri Jan 31 22:39:12 2003
From: skip at pobox.com (Skip Montanaro)
Date: Fri, 31 Jan 2003 15:39:12 -0600
Subject: [Csv] csv.QUOTE_NEVER?
Message-ID: <15930.60672.18719.407166@montanaro.dyndns.org>

The three quoting constants are currently defined as QUOTE_MINIMAL,
QUOTE_ALL and QUOTE_NONNUMERIC.  Didn't we decide there would be a
QUOTE_NEVER constant as well?

Skip

From skip at pobox.com  Fri Jan 31 22:59:40 2003
From: skip at pobox.com (Skip Montanaro)
Date: Fri, 31 Jan 2003 15:59:40 -0600
Subject: [Csv] PEP 305 - CSV File API
Message-ID: <15930.61900.995242.11815@montanaro.dyndns.org>


A new PEP (305), "CSV File API", is available for reader feedback.  This PEP
describes an API and implementation for reading and writing CSV files.
There is a sample implementation available as well which you can take out
for a spin.  The PEP is available at

    http://www.python.org/peps/pep-0305.html

(The latest version as of this note is 1.9.  Please wait until that is
available to grab a copy on which to comment.)

The sample implementation, which is heavily based on Object Craft's existing
csv module, is available at

    http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/nondist/sandbox/csv/

To those people who are already using the Object Craft module, make sure you
rename your csv.so file before trying this one out.

Please send feedback to csv at mail.mojam.com.  You can subscribe to that list
at

    http://manatee.mojam.com/mailman/listinfo/csv

That page contains a pointer to the list archives.

(Many thanks BTW to Barry Warsaw and the Mailman crew for Mailman 2.1.  It
looks awesome.)

-- 
Skip Montanaro
skip at pobox.com
http://www.musi-cal.com/