From skip at pobox.com Tue Mar 2 16:23:38 2004 From: skip at pobox.com (Skip Montanaro) Date: Tue, 2 Mar 2004 09:23:38 -0600 Subject: [Csv] Re: csv bugs In-Reply-To: References: Message-ID: <16452.42746.753582.600239@montanaro.dyndns.org> (A better place for this discussion would probably be csv at mail.mojam.com. I'm adding it to the cc list.) Magnus> It seems that when a line termination is escaped (using the Magnus> current escape character), csv.reader treats it as a line Magnus> continuation, which is well an good -- but it doesn't discard Magnus> the escape character; instead, it escapes it implicitly. This Magnus> seems like a bug to me. E.g. Magnus> foo:bar:baz\ Magnus> frozz:bozz Magnus> with separator ':' and escape character '\\' is parsed into Magnus> ['foo', 'bar', 'baz\\\nfrozz', 'bozz'] Magnus> In my opinion, it *ought* to be parsed into Magnus> ['foo', 'bar', 'baz\nfrozz', 'bozz'] Magnus> As far as I know, this is the UNIX convention, as used in (e.g.) Magnus> /etc/passwd. That may be, however development of the csv module's parser was driven by how Microsoft Excel behaves. The assumption was (rightly I think) that Excel reads or writes more CSV files than anything else. I don't believe it does anything with backslashes. Magnus> Am I off target here? If the current behaviour is desirable Magnus> (although I can't see why it should be) then at least I think Magnus> there should be a way of implementing "normal" line Magnus> continuations (as in my example), which is the standard UNIX Magnus> behavior, and the behavior of Python source, for that Magnus> matter. Otherwise, csv can't be used to parse (e.g.) Magnus> /etc/passwd... You're welcome to submit a patch. I don't have time for it. Magnus> And another thing: Perhaps a 'passwd' dialect could be added Magnus> alongside 'excel'? Something like: Magnus> class passwd(Dialect): Magnus> delimiter = ':' Magnus> doublequote = False Magnus> escapechar = '\\' Magnus> lineterminator = '\n' Magnus> quotechar = '?' Magnus> quoting = QUOTE_NONE Magnus> skipinitialspace = False Magnus> register_dialect("passwd", passwd) I'll take a look at that. Magnus> For some reason you *have* to supply a quotechar, even if you Magnus> set QUOTE_NONE... I guess that's a bug too, in my book. Maybe. Maybe just a feature. Magnus> If there are no objections, I might submit some of this as a bug Magnus> report or two (or even a patch). Please do. Skip From magnus at hetland.org Tue Mar 2 18:24:46 2004 From: magnus at hetland.org (Magnus Lie Hetland) Date: Tue, 2 Mar 2004 18:24:46 +0100 Subject: [Csv] Re: csv bugs Message-ID: <20040302172446.GA17004@idi.ntnu.no> > (A better place for this discussion would probably be > csv at mail.mojam.com. I'm adding it to the cc list.) Ah -- sorry. I wasn't aware of the list. I've subscribed now. [snip] > That may be, however development of the csv module's parser was > driven by how Microsoft Excel behaves. But wasn't also a driving force to allow "full" customization? > The assumption was (rightly I think) that Excel reads or writes more > CSV files than anything else. I don't believe it does anything with > backslashes. I'm sure you're right. The point is that the csv module supports escape characters, and I believe the thing I pointed out is a missing piece of functionality for those. In other words: The Excel dialect uses quoting to deal with in-field separators, quotes and newlines. The passwd dialect uses escapes to deal with these. *However*, the csv module only supports dealing with separators and escape characters using the escape character (quotes are a non-issue, of course), not newlines. In other words, if you choose to use an escape character rather than quotes, you can't have newlines in your fields. Almost, anyway. The fact is, as far as I can see, that you *can* escape newlines, but in that special case, the escape character *isn't* removed (as it is when you escape separators or escape characters). This seems inconsistent, and has nothing to do with backslashes in particular, just how escape characters should behave. [snip] > You're welcome to submit a patch. I don't have time for it. OK -- I guess I'm mainly looking for some feedback about whether this seems like a reasonable behavior. (I'm quite thoroughly convinced that it is, but I may very well be wrong :) I haven't looked at the C implementation, so no promises about a patch there... :/ > > And another thing: Perhaps a 'passwd' dialect could be added > > alongside 'excel'? Something like: [snip] > I'll take a look at that. Not sure about setting the quote character to '?' here, but since it doesn't matter and you need to have one, it seemed like a natural choice. (None wasn't allowed.) > > For some reason you *have* to supply a quotechar, even if you > > set QUOTE_NONE... I guess that's a bug too, in my book. > > Maybe. Maybe just a feature. Well, maybe ;) But if you don't need an escape character when you're using quotes, I don't think you should need quotes when you're using an escape character. Then again: I guess you do use an escape character (i.e. a double quote) in the quoted mode as well, which may be what's complicating the semantics and confusing me. Not sure how "foo " bar" should be interpreted, for example. In this case removing the quote may not make sense. And... Adding another switch (or something) dictating the behavior of the escape character doesn't seem good... > Skip -- Magnus Lie Hetland "The mind is not a vessel to be filled, http://hetland.org but a fire to be lighted." [Plutarch] From mamo19 at handelsbanken.se Fri Mar 5 17:48:54 2004 From: mamo19 at handelsbanken.se (Marjaneh Mojaverian) Date: Fri, 5 Mar 2004 17:48:54 +0100 Subject: [Csv] PEP 305 Message-ID: I would like to know which extenstion library to use if I need to use CSV module in an external application. Regards Marjaneh -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/csv/attachments/20040305/2e154f5d/attachment.htm From magnus at hetland.org Sat Mar 13 16:48:08 2004 From: magnus at hetland.org (Magnus Lie Hetland) Date: Sat, 13 Mar 2004 16:48:08 +0100 Subject: [Csv] Thoughts about a patch Message-ID: <20040313154808.GA8421@idi.ntnu.no> (First some background -- se bottom for my suggestion.) I mentioned offering a patch to support Unix-style password syntax. I've had a look at the _csv.c code, and it seems quite possible to do. I've also had the following pointed out to me: http://groups.google.com/groups?selm=vsb89q1d3n5qb1%40corp.supernews.com I guess I could take a look at some of the issues there too, while I'm at it (since they seem related)? The syntax I'm thinking about is described in ESR's newest book (The Art of Unix Programming): http://www.catb.org/~esr/writings/taoup/html/ch05s02.html#id2901882 Basically there is no quoting, only escaping. Also, escaped newlines are ignored (or so he says -- I expect the convention here would be to treat it as a single space character) and it is possible to include c-style backslash escapes (just like in Python strings; \n is a newline and so forth). The current behavior of _csv.c is simply to put in the escape character verbatim, unless it precedes another escape character, a delimiter, or a quote character. So I guess it's only necessary to expand slightly the 'case ESCAPED_CHAR:' bit. Suggestion (ignoring the bugs reported int he Usenet post above, for now): Don't go all-out on this. Simply interpret '\\\n' as '\n', just like we interpret '\\:' as ':' (if ':' is the field separator). After all, '\n' (or, in general, the record separator) is just as much a special character in need of quoting as the other three (escape, delimiter, and quote character). C-style escapes, however, aren't as integral to the CSV language -- they can be handled afterward, when interpreting the contents. Does this seem OK? It would mean slight backward-breakage, but it seems odd that someone should have escaped newlines and still wanted the escape character to be left in place, doesn't it? If this seems OK I'll be happy to write up a patch. -- Magnus Lie Hetland "The mind is not a vessel to be filled, http://hetland.org but a fire to be lighted." [Plutarch] From magnus at hetland.org Sat Mar 13 16:54:04 2004 From: magnus at hetland.org (Magnus Lie Hetland) Date: Sat, 13 Mar 2004 16:54:04 +0100 Subject: [Csv] Thoughts about a patch Message-ID: <20040313155404.GA9516@idi.ntnu.no> I guess I just haven't understood the code well enough yet, but in the parsing code there are comparisons of the type if (c == '\n') I suppose the newlines are normalized versions of lineterminator? In other words, no matter what the line terminator is, it is safe to pretend that it has been changed to '\n' in the parsing case statement? Or? (I mean, I've tried to use lineterminator='|' and that worked just nicely, but I don't see the use of lineterminator in the case statement anywhere.) -- Magnus Lie Hetland "The mind is not a vessel to be filled, http://hetland.org but a fire to be lighted." [Plutarch] From andrewm at object-craft.com.au Mon Mar 15 00:21:58 2004 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Mon, 15 Mar 2004 10:21:58 +1100 Subject: [Csv] Thoughts about a patch In-Reply-To: Message from Magnus Lie Hetland of "Sat, 13 Mar 2004 16:48:08 BST." <20040313154808.GA8421@idi.ntnu.no> References: <20040313154808.GA8421@idi.ntnu.no> Message-ID: <20040314232158.7C7543C0BA@coffee.object-craft.com.au> >Don't go all-out on this. Simply interpret '\\\n' as '\n', just like >we interpret '\\:' as ':' (if ':' is the field separator). After all, >'\n' (or, in general, the record separator) is just as much a special >character in need of quoting as the other three (escape, delimiter, >and quote character). I guess that sounds reasonable. It's often very difficult to make changes to code that is in the standard distribution - there always seems to be someone relying on the previous behaviour... 8-) You might want to make sure that, inside quotes, the special meaning of the escape character is removed (on the basis that Excel uses quotes exclusively (no quote character). However - I suspect we didn't get this right, and still honour the escape within a quoted string - if you find that we still honour the escape within a quoted string, your change should too (to remain consistent). Did that make any sense? -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Mon Mar 15 00:26:51 2004 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Mon, 15 Mar 2004 10:26:51 +1100 Subject: [Csv] Thoughts about a patch In-Reply-To: Message from Magnus Lie Hetland of "Sat, 13 Mar 2004 16:54:04 BST." <20040313155404.GA9516@idi.ntnu.no> References: <20040313155404.GA9516@idi.ntnu.no> Message-ID: <20040314232651.E48C43C0BA@coffee.object-craft.com.au> >I guess I just haven't understood the code well enough yet, but in the >parsing code there are comparisons of the type > > if (c == '\n') > >I suppose the newlines are normalized versions of lineterminator? In >other words, no matter what the line terminator is, it is safe to >pretend that it has been changed to '\n' in the parsing case >statement? Or? (I mean, I've tried to use lineterminator='|' and that >worked just nicely, but I don't see the use of lineterminator in the >case statement anywhere.) One thing to bear in mind is the history of the CSV module - it dates back to Python 1.5 times, when python didn't have universal newline support. If I remember correctly, lineterminator is only used when generating CSV output, not when parsing input. On input, the value of lineterminator is ignored, and \r and \n are hard-coded. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From magnus at hetland.org Mon Mar 15 08:44:33 2004 From: magnus at hetland.org (Magnus Lie Hetland) Date: Mon, 15 Mar 2004 08:44:33 +0100 Subject: [Csv] Thoughts about a patch In-Reply-To: <20040314232651.E48C43C0BA@coffee.object-craft.com.au> References: <20040313155404.GA9516@idi.ntnu.no> <20040314232651.E48C43C0BA@coffee.object-craft.com.au> Message-ID: <20040315074433.GA16067@idi.ntnu.no> Andrew McNamara : > > >I guess I just haven't understood the code well enough yet, but in the > >parsing code there are comparisons of the type > > > > if (c == '\n') > > > >I suppose the newlines are normalized versions of lineterminator? In > >other words, no matter what the line terminator is, it is safe to > >pretend that it has been changed to '\n' in the parsing case > >statement? Or? (I mean, I've tried to use lineterminator='|' and that > >worked just nicely, but I don't see the use of lineterminator in the > >case statement anywhere.) > > One thing to bear in mind is the history of the CSV module - it > dates back to Python 1.5 times, when python didn't have universal > newline support. I see. Even so -- I don't see how universal newline support is needed for this...? > If I remember correctly, lineterminator is only used when generating CSV > output, not when parsing input. On input, the value of lineterminator > is ignored, and \r and \n are hard-coded. Oh -- how unfortunate :] Is this documented in the PEP/standard docs? I've just browsed them, but couldn't find the distinction between parameters that affect reading and those affecting writing. To quote the PEP: "In addition to the dialect argument, both the reader and writer constructors take several specific formatting parameters, specified as keyword parameters." One of the parameters listed under this (which, then, applies to the reader) is: "lineterminator specifies the character sequence which should terminate rows." It seems highly natural to me that reader and writer should be completely symmetrical here -- i.e. you should *definitely* be able to read back your own output, using the same Dialect (IMO). (I do see something hinting at this problem in item 5 of the issue list, though.) I guess I had my eyes crossed when I did my experiment with lineterminator set to '|' -- I thought it worked when reading, but you're right -- it doesn't. In other words, a potential patch should probably also add support for parsing arbitrary line terminators -- or? It could, of course, be that I should simply write the parsing code into my own projects in Python. It just seems a shame not to use the csv module when it exists. It seems to sit on the brink of generality, just a tad biased toward the Microsoft dialect (which was, I gather, part of the original design goals). -- Magnus Lie Hetland "The mind is not a vessel to be filled, http://hetland.org but a fire to be lighted." [Plutarch] From magnus at hetland.org Mon Mar 15 09:09:45 2004 From: magnus at hetland.org (Magnus Lie Hetland) Date: Mon, 15 Mar 2004 09:09:45 +0100 Subject: [Csv] Thoughts about a patch In-Reply-To: <20040314232158.7C7543C0BA@coffee.object-craft.com.au> References: <20040313154808.GA8421@idi.ntnu.no> <20040314232158.7C7543C0BA@coffee.object-craft.com.au> Message-ID: <20040315080945.GC16067@idi.ntnu.no> Andrew McNamara : > > >Don't go all-out on this. Simply interpret '\\\n' as '\n', just like > >we interpret '\\:' as ':' (if ':' is the field separator). After all, > >'\n' (or, in general, the record separator) is just as much a special > >character in need of quoting as the other three (escape, delimiter, > >and quote character). > > I guess that sounds reasonable. OK. Now, this applies to reading, so it would imply making lineterminator work for readers as well. > It's often very difficult to make changes to code that is in the > standard distribution - there always seems to be someone relying on > the previous behaviour... 8-) Yes, indeed. I've been thinking about that. Perhaps there should be some flag or mode or something that decides how things work? For example, there could be a "compatibility" flag that is True by default; or there could be an "ESCAPE_ONLY" value for quoting... Or even separate functions or a separate submodule... I don't know. It seems that, perhaps, even though this is a relatively minor issue, it might warrant a PEP...? > You might want to make sure that, inside quotes, the special meaning > of the escape character is removed (on the basis that Excel uses > quotes exclusively (no quote character). Hm. How about a quoted field like this, then? "Foo bar \" baz" With '"' as quotechar and '\\' as escapechar. Wouldn't it be natural to allow this, and to interpret '\\"' as '"'? I mean, if you *didn't* want this behavior, you'd set escapechar to None -- or? > However - I suspect we didn't get this right, and still honour the > escape within a quoted string - if you find that we still honour the > escape within a quoted string, your change should too (to remain > consistent). I'm not sure exactly how you mean it should behave. I understand that, for example "foo \, bar" should become ['foo \\, bar'] and not ['foo , bar'] But still, "foo \" bar" should become ['foo " bar'] in my opinion. Don't you agree? However, as it is, "foo \, bar" is interpreted as ['foo , bar']. It almost seems like this should be dialect-dependent -- but, then again, lots of interacting parameters is a recipy for (combinatorial) disaster. (And the vagueness and complexity of the Microsoft CSV dialect isn't helping :) > Did that make any sense? Sure. I think the core issue, IMO, is what the escape character really means, and whether that meaning can be constant or whether it must depend on something else. OTOH: It could be possible to say that the behavior when using quoting *and* an escape character together is undefined -- that quoting and escaping are two mutually exclusive ways of dealing with separators (both field and record (i.e. line) separators) in fields. Does that seem reasonable? One could even issue a warning if the user has quotechar and escapechar set at the same time, maybe? Then we'd get away from the pesky interactions between the two... (Similar warnings would apply to doublequote, of course.) And the behavior of the escape character, when quotes are out of the picture, could be defined as something like: "when preceding either separator, lineterminator or escapechar, the escapechar is removed and the separator/lineterminator/escapechar is included verbatim in the field." There would still be two remaining issue, however: 1. How should an escapechar preceding some *other* character be interpreted? The most backward-compatible approach would simply be to include the escape character verbatim -- but then escaping the escape character becomes redundant. It would also make it hard to interpret special sequences such as \n or \t for the client code, because the backslash in these sequences would end up at the same "escape level" as the \\. For example, foo \\n bar \n would be read in as "foo \n bar \n" -- and the client code couldn't tell the two apart. Not good. 2. Is it really okay for an escape character to escape a multi-character sequence? If it is to escape the lineterminator, it must work for multi-character sequences such as '\r\n'. This *might* lead to confusion, as the convention for escape characters is to escape only the following character. A possibility is to let the escape character mean "reproduce the following character verbatim and remove me, no matter what". Then '\n' and '\t' would simply mean 'n' and 't' -- possibly surprising -- and each character in the line terminator would be escaped separately. Oh, well. Maybe I should just go with XML after all. -- Magnus Lie Hetland "The mind is not a vessel to be filled, http://hetland.org but a fire to be lighted." [Plutarch] From magnus at hetland.org Mon Mar 15 09:20:45 2004 From: magnus at hetland.org (Magnus Lie Hetland) Date: Mon, 15 Mar 2004 09:20:45 +0100 Subject: [Csv] Thoughts about a patch In-Reply-To: <20040315080945.GC16067@idi.ntnu.no> References: <20040313154808.GA8421@idi.ntnu.no> <20040314232158.7C7543C0BA@coffee.object-craft.com.au> <20040315080945.GC16067@idi.ntnu.no> Message-ID: <20040315082045.GA19927@idi.ntnu.no> Magnus Lie Hetland : [snip] > Does that seem reasonable? One could even issue a warning if the user > has quotechar and escapechar set at the same time, maybe? Then we'd > get away from the pesky interactions between the two... (Similar > warnings would apply to doublequote, of course.) > > And the behavior of the escape character, when quotes are out of the > picture, could be defined as something like: "when preceding either > separator, lineterminator or escapechar, the escapechar is removed and > the separator/lineterminator/escapechar is included verbatim in the > field." A possible "strictification" would be to disallow the use of the escape character *except* in the places where it has an obvious meaning (in front of a field/record separator or another escape char). Still not sure how to escape with multi-character line terminators, though. -- Magnus Lie Hetland "The mind is not a vessel to be filled, http://hetland.org but a fire to be lighted." [Plutarch]