reading back what you wrote (was Re: Andrew Dalke's space example (was Re: [Csv] csv))

Mon Feb 17 02:13:17 CET 2003

On 17 Feb 2003 10:30:47 +1100, Dave Cole <djc at object-craft.com.au> wrote:

>>>>>> "John" == John Machin <sjmachin at lexicon.net> writes:
>
> John> [Dave Cole]
>>> Aside from the quote of '\0', I am not sure I follow what you mean.
>>> If you set quoting so that it produces ambiguous output that is
>>> hardly the fault of the writer.
>
> John> Of course not. What I was getting at was that the ability to
> John> write various schemes (some ambiguous, some not) is provided,
> John> but it is not possible to read back all unambiguous schemes, and
> John> there is little if any support for checking that the data
> John> corresponds to the scheme the caller thinks was used to write
> John> it, and there are no options to drive what to do on input if the
> John> writing scheme was ambiguous.
>
> I must be a bit thick or something...  I have the feeling you are
> correct, but I just can't see it.  Can you provide some (simple)
> examples and suggest where the code could be improved?
>

Here is my approach:

(1) Define not only a scheme for writing "standard" CSV but schemes for 
writing the various mutations that I have come across

(2) Have a strict_output option to govern behaviour when the input is such 
that output cannot be reversed (exception immediately, exception at end if 
error count is not zero, no exception)

Example (a) someone wants to write using a no-quoted scheme but they have a 
delimiter inside a field  (b) a doublequote=False, escapechar=None scheme 
but there is a quotechar in the data

(3) On input, require the caller to specify exactly what scheme they think 
was used to create the data. Check carefully that the incoming data 
corresponds to the alleged scheme. Again, have a strict_input option.

Here we have some data that was written by a doublequote=False, 
escapechar=None, quoting=QUOTE_ALL scheme:

>>> badcsv = ['"quotes not doubled"', '"rear of "Fubar Flats""', '""Thistle 
>>> Do" RMB 123"']

and it is munged w/o warning if read with standard CSV settings:

>>> [x for x in csv.reader(badcsv)]
[['quotes not doubled'], ['rear of Fubar Flats""'], ['Thistle Do" RMB 
123"']]

and trying to tell the csv module what to do doesn't help:

>>> [x for x in csv.reader(badcsv, doublequote=False, escapechar=None)]
[['quotes not doubled'], ['rear of Fubar Flats""'], ['Thistle Do" RMB 
123"']]

It is possible to recover the data if each field had an even number of 
quotes, but this requires a quite different state machine:

>>> badcsvstr = '"quotes not doubled"\n"rear of "Fubar Flats""\n""Thistle 
>>> Do" RMB 123"'
# my module requires input iterables only to deliver one or more bytes per 
iteration i.e can be more or less than exactly one line and the module does 
the end-of-line detection and yes it special-cases the iterable being a 
string, for obvious efficiency reasons.

>>> [x for x in delimited.importer(badcsvstr, 
>>> quote_mode=delimited.QUOTE_SINGLE)]
[['quotes not doubled'], ['rear of "Fubar Flats"'], ['"Thistle Do" RMB 
123']]
# We've recovered what was most likely to have been in the original data

and will crack it if told that this data is standard CSV:

>>> impo = delimited.importer(badcsvstr)
>>> list(impo)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
delimited.DataError: After rear_quote, expected rear_quote, delimiter or 
newline; found <F> (hex 46)

and just in case you're trying to find the offending line in a 100 Mb file:

>>> impo.input_row_number, impo.input_char_column
(1, 10) # zero-relative

Hope this explains where I'm coming from ...

Cheers,
John