CSV revisited

Sun Feb 10 16:59:11 EST 2002

Dave Cole <djc at object-craft.com.au> wrote in message news:<m3ofiyqgjf.fsf at ferret.object-craft.com.au>...
> >>>>> "Raymond" == Raymond Hettinger <othello at javanet.com> writes:
> 
> Raymond> Documentation of key design decisions.  For instance, is the
> Raymond> library limited to reading various CSV formats or is
> Raymond> write-back in the same style to be supported.

YES. Has to be supported, to get data into applications that don't
support the MS one true way.

> Raymond> Can the first record, contain optional field names?

YES, but the implementation has to be layered; at the bottom layer the
deal should be to get rows of data into and out of Python as cleanly,
correctly and quickly as possible. The next layer up could be told
that the first row contains field names and (e.g.) stuff them into a
dict so that the caller could refer to columns by name rather than by
integer index. The next layer up could have a GUI. Let's design a
reasonable separation of layers. Let's get the bottom layer going
first.

> Raymond> Will incomplete records raise an exception?

Variations of format should IMO be handled by requiring the caller to
specify the expected format. The module should by default raise an
exception if the data deviates from that format. There may be options
to suppress the exceptions.

Back to Dave:
> 
> I am the author of the fast CSV module at:
> 
>         http://www.object-craft.com.au/projects/csv/
> 
> One of the problems with CSV is that everyone has their own ideas
> about what should be handled and how it should be handled.  If there
> was some sort of effort to document the correct behaviour of a CSV
> parser then I would be extremely happy to make mine conform to that
> document.

Ya, ya, me too. See http://www.lexicon.net/sjmachin/delimited.htm

The basic problem is that many people have "interesting" ideas on how
to create CSV-style data files. Query tools for commercial databases
have been known not to properly double the quotes. Given
address_line_1 containing
   3rd Floor, "Murgatroyd Mansions"
they produce
   "3rd Floor, "Murgatroyd Mansions""
instead of 
   "3rd Floor, ""Murgatroyd Mansions"""
This can be recovered if the presumption that there were an even
number of quotes in the original data is correct, but it needs a quite
different FSM to handle it [my quote_mode=1].

Other variations that I have seen are:
(a) Different front & back quotes:
      `3rd Floor, "Murgatroyd Mansions"'
(b) Alternate quote character [my quote_mode=3]
   3rd Floor, Murgatroyd Mansions
produces
   "3rd Floor, Murgatroyd Mansions"
as expected, but
   3rd Floor, "Murgatroyd Mansions"
produces
   '3rd Floor, "Murgatroyd Mansions"'
but this approach is no good if the data can contain both " and '

:-)
A few other desiderata: shouldn't leak memory, shouldn't die with a
SEGV (or SRE recursion limit exceeded!!!) if you ask it to unpack say
"x" * 70000 ...
(-: