[Csv] Thoughts about a patch

Magnus Lie Hetland magnus at hetland.org
Mon Mar 15 09:09:45 CET 2004


Andrew McNamara <andrewm at object-craft.com.au>:
>
> >Don't go all-out on this. Simply interpret '\\\n' as '\n', just like
> >we interpret '\\:' as ':' (if ':' is the field separator). After all,
> >'\n' (or, in general, the record separator) is just as much a special
> >character in need of quoting as the other three (escape, delimiter,
> >and quote character).
> 
> I guess that sounds reasonable. 

OK. Now, this applies to reading, so it would imply making
lineterminator work for readers as well.

> It's often very difficult to make changes to code that is in the
> standard distribution - there always seems to be someone relying on
> the previous behaviour... 8-)

Yes, indeed. I've been thinking about that. Perhaps there should be
some flag or mode or something that decides how things work? For
example, there could be a "compatibility" flag that is True by
default; or there could be an "ESCAPE_ONLY" value for quoting... Or
even separate functions or a separate submodule... I don't know.

It seems that, perhaps, even though this is a relatively minor issue,
it might warrant a PEP...?

> You might want to make sure that, inside quotes, the special meaning
> of the escape character is removed (on the basis that Excel uses
> quotes exclusively (no quote character).

Hm. How about a quoted field like this, then?

  "Foo bar \" baz"

With '"' as quotechar and '\\' as escapechar. Wouldn't it be natural
to allow this, and to interpret '\\"' as '"'? I mean, if you *didn't*
want this behavior, you'd set escapechar to None -- or?

> However - I suspect we didn't get this right, and still honour the
> escape within a quoted string - if you find that we still honour the
> escape within a quoted string, your change should too (to remain
> consistent).

I'm not sure exactly how you mean it should behave. I understand that,
for example

  "foo \, bar"

should become

  ['foo \\, bar']

and not

  ['foo , bar']

But still,

  "foo \" bar"

should become

  ['foo " bar']

in my opinion. Don't you agree?

However, as it is, "foo \, bar" is interpreted as ['foo , bar'].

It almost seems like this should be dialect-dependent -- but, then
again, lots of interacting parameters is a recipy for (combinatorial)
disaster. (And the vagueness and complexity of the Microsoft CSV
dialect isn't helping :)

> Did that make any sense?

Sure. I think the core issue, IMO, is what the escape character really
means, and whether that meaning can be constant or whether it must
depend on something else. 

OTOH: It could be possible to say that the behavior when using quoting
*and* an escape character together is undefined -- that quoting and
escaping are two mutually exclusive ways of dealing with separators
(both field and record (i.e. line) separators) in fields.

Does that seem reasonable? One could even issue a warning if the user
has quotechar and escapechar set at the same time, maybe? Then we'd
get away from the pesky interactions between the two... (Similar
warnings would apply to doublequote, of course.)

And the behavior of the escape character, when quotes are out of the
picture, could be defined as something like: "when preceding either
separator, lineterminator or escapechar, the escapechar is removed and
the separator/lineterminator/escapechar is included verbatim in the
field."

There would still be two remaining issue, however:

 1. How should an escapechar preceding some *other* character be
    interpreted? The most backward-compatible approach would simply be
    to include the escape character verbatim -- but then escaping the
    escape character becomes redundant. It would also make it hard to
    interpret special sequences such as \n or \t for the client code,
    because the backslash in these sequences would end up at the same
    "escape level" as the \\. For example,

      foo \\n bar \n

    would be read in as "foo \n bar \n" -- and the client code
    couldn't tell the two apart. Not good.
 
 2. Is it really okay for an escape character to escape a
    multi-character sequence? If it is to escape the lineterminator,
    it must work for multi-character sequences such as '\r\n'. This
    *might* lead to confusion, as the convention for escape characters
    is to escape only the following character.

A possibility is to let the escape character mean "reproduce the
following character verbatim and remove me, no matter what". Then '\n'
and '\t' would simply mean 'n' and 't' -- possibly surprising -- and
each character in the line terminator would be escaped separately.

Oh, well. Maybe I should just go with XML after all. <sigh/wink>

-- 
Magnus Lie Hetland           "The mind is not a vessel to be filled,
http://hetland.org            but a fire to be lighted."  [Plutarch]


More information about the Csv mailing list