[Csv] Devil in the details, including the small one between delimiters and quotechars

Thu Jan 30 15:03:57 CET 2003

Checking against the current version of the CSV parser.

Cliff> 1, "not quoted","quoted"

Cliff> It seems reasonable to parse this as:

Cliff> [1, ' "not quoted"', "quoted"]

Cliff> which is the described Excel behavior.

>>> import _csv
>>> p = _csv.parser()
>>> p.parse('1, "not quoted","quoted"')
['1', ' "not quoted"', 'quoted']

Looks OK.

Cliff> Now consider

Cliff> 1,"not quoted" ,"quoted"

Cliff> Is the second field quoted or not?  If it is, do we discard the
Cliff> extraneous whitespace following it or raise an exception?

The current version of the _csv parser can do two things depending
upon the value of the strict parameter.

>>> p.strict  
0
>>> p.parse('1,"not quoted" ,"quoted"')
['1', 'not quoted ', 'quoted']
>>> p.strict = 1
>>> p.parse('1,"not quoted" ,"quoted"')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
_csv.Error: , expected after "

Skip> Well, there's always the, "be flexible in what you accept,
Skip> strict in what you generate" school of thought.  In the above,
Skip> that would suggest the list returned would be

Skip>     ['1', 'not quoted', 'quoted']

Why wouldn't you include the trailing space on the second field?

Andrew, what does Excel do here?

Hmm...  I was sort of expecting _csv to do this:

['1', 'not quoted" ', 'quoted']

Skip> It seems like a minor formatting glitch.  How about a warning?
Skip> Or a "strict" flag for the parser?

I think that there are enough variations here that strict is not
enough.  The second one does look a bit bogus...

['1', '"not quoted" ', 'quoted']
['1', 'not quoted" ', 'quoted']
['1', 'not quoted ', 'quoted']

Cliff> Worse, consider this
Cliff> "quoted", "not quoted, but this ""field"" has delimiters and quotes"

Skip> Depends on the setting of skipinitialspaces.  If false, you get
Skip>   ['quoted', ' "not quoted', ' but this ""field"" has delimiters and quotes"']

parser does this:

['quoted', ' "not quoted', ' but this ""field"" has delimiters and quotes"']

Skip> if True, I think you get

Skip>   ['quoted', 'not quoted, but this "field" has delimiters and quotes']

Yeah, but the doublequote stuff is only meant for quoted fields (or is
it).

Cliff> How should this parse?  I say free exceptions for everyone.

Don't know if exceptions are what we need.  We just need to come up
with parameters which control the parser to sufficient detail to
handle the dialect variations.

Cliff> I propose space between delimiters and quotes raise an exception
Cliff> and let's be done with it.  I don't think this really affects
Cliff> Excel compatibility since Excel will never generate this type of
Cliff> file and doesn't require it for import.  It's true that some
Cliff> files that Excel would import (probably incorrectly) won't import
Cliff> in CSV, but I think that's outside the scope of Excel
Cliff> compatibility.

Skip> Sounds good to me.

I dunno.  We should look at the corner cases and handle as many as we
can in the dialect.  That is sort of the whole point of why we are
here.

Cliff> Anyway, I know no one has said "On your mark, get set" yet, but I
Cliff> can't think without code sitting in front of me, breaking worse
Cliff> with every keystroke, so in addition to creating some test cases,
Cliff> I've hacked up a very preliminary CSV module so we have something
Cliff> to play with.  I was up til 6am so if there's anything odd, I
Cliff> blame it on lack of sleep and the feverish optimism and glossing
Cliff> of detail that comes with it.

Skip> Perhaps you and Dave were in a race but didn't know it? ;-)

When Skip mentioned that we were going to have the speedy Object Craft
parser I just checked in the _csv module.  It does not handle all of
what we have been discussing, but it is close.

- Dave

-- 
http://www.object-craft.com.au