From skip at pobox.com Wed Dec 28 17:15:25 2005 From: skip at pobox.com (skip at pobox.com) Date: Wed, 28 Dec 2005 10:15:25 -0600 Subject: [Csv] Sniffer empty delimiter Message-ID: <17330.47645.32970.405332@montanaro.dyndns.org> In this bug report: http://python.org/sf/1157169 Neil Schemenauer reports a problem with this code: >>> d = csv.Sniffer().sniff('abc', ['\t', ',']) >>> csv.reader(['abc'], d) Traceback (most recent call last): File "", line 1, in ? TypeError: bad argument type for built-in operation In his Sniffer case it is clear that neither TAB nor comma are an explicit delimiter. It's also not clear what the delimiter is. The generated dialect has a resulting empty delimiter. I can see three possible remedies: 1. raise csv.Error from Sniffer.sniff 2. return comma as the "standard" delimiter or because the sample appears to only have a single comma 3. return TAB as it's first in the delimiters list. I't sure there are other candidates ("b" because it separates "a" and "c"?) Any thoughts about the best "remedy" to this problem? It's clear that letting the empty delimiter escape into the wild is a problem. Skip From sjmachin at lexicon.net Wed Dec 28 23:24:15 2005 From: sjmachin at lexicon.net (John Machin) Date: Thu, 29 Dec 2005 09:24:15 +1100 Subject: [Csv] Sniffer empty delimiter In-Reply-To: <17330.47645.32970.405332@montanaro.dyndns.org> References: <17330.47645.32970.405332@montanaro.dyndns.org> Message-ID: <43B3108F.2010403@lexicon.net> skip at pobox.com wrote: > In this bug report: > > http://python.org/sf/1157169 > > Neil Schemenauer reports a problem with this code: > > >>> d = csv.Sniffer().sniff('abc', ['\t', ',']) > >>> csv.reader(['abc'], d) > Traceback (most recent call last): > File "", line 1, in ? > TypeError: bad argument type for built-in operation > > In his Sniffer case it is clear that neither TAB nor comma are an explicit > delimiter. It's also not clear what the delimiter is. The generated > dialect has a resulting empty delimiter. I can see three possible remedies: > > 1. raise csv.Error from Sniffer.sniff > > 2. return comma as the "standard" delimiter or because the sample > appears to only have a single comma > > 3. return TAB as it's first in the delimiters list. > > I't sure there are other candidates ("b" because it separates "a" and "c"?) > > Any thoughts about the best "remedy" to this problem? It's clear that > letting the empty delimiter escape into the wild is a problem. > > Skip Python 2.4.2 (#67, Sep 28 2005, 12:41:11) [MSC v.1310 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import csv >>> d = csv.Sniffer().sniff('a|b|c|d|e', ['\t', ',']) >>> d.delimiter '' >>> d = csv.Sniffer().sniff('a|b|c|d|e') >>> d.delimiter 'a' >>> Skip, Some thoughts: (1) IMHO it should *NEVER* return an alphabetic or numeric character as the delimiter. (2) If there is insufficient sample to determine the dialect's attributes, then it shouldn't pluck them out of the air, with no indication to the caller that there might be a problem. IOW I don't like the "remedies" of "return standard delimiter" and "return first delimiter". It should raise csv.Error; the discerning caller can then take appropriate action. (3) Some documentation on how the 2nd arg is used would be a good idea, as would be an explanation of the relationship with the undocumented "preferred" attribute: >>> csv.Sniffer().preferred [',', '\t', ';', ' ', ':'] >>> (4) Too late to change now, but having a class with no args to its constructor and only one other method has a whiff of some other language :-) (5) But the doco is not correct, there are 2 non-constructor methods: >>> csv.Sniffer().has_header("x") True Cheers, John From skip at pobox.com Thu Dec 29 01:07:50 2005 From: skip at pobox.com (skip at pobox.com) Date: Wed, 28 Dec 2005 18:07:50 -0600 Subject: [Csv] Sniffer empty delimiter In-Reply-To: <43B3108F.2010403@lexicon.net> References: <17330.47645.32970.405332@montanaro.dyndns.org> <43B3108F.2010403@lexicon.net> Message-ID: <17331.10454.131254.852426@montanaro.dyndns.org> Python 2.4.2 (#67, Sep 28 2005, 12:41:11) [MSC v.1310 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import csv >>> d = csv.Sniffer().sniff('a|b|c|d|e', ['\t', ',']) >>> d.delimiter '' >>> d = csv.Sniffer().sniff('a|b|c|d|e') >>> d.delimiter 'a' Both of these seem wrong to me at some level. I tend to agree with you that if the delimiter fails it should raise an exception, certainly if the delimiters argument defines a set of characters from which the actual delimiter must be chosen (does it?). The second has to be considered a bug doesn't it? John> (1) IMHO it should *NEVER* return an alphabetic or numeric John> character as the delimiter. Probably a good rule of thumb. John> (2) If there is insufficient sample to determine the dialect's John> attributes, then it shouldn't pluck them out of the air, with John> no indication to the caller that there might be a problem. IOW John> I don't like the "remedies" of "return standard delimiter" and John> "return first delimiter". It should raise csv.Error; the John> discerning caller can then take appropriate action. If I have a csv file that happens to only have one column and I'm using the sniffer (presumably because I have an app that processes somewhat arbitrary csv files) I'd hate for it to fail in that one case. For that case maybe we can define an optional default arg that is a single character. Failing all other tests, the default is returned. John> (3) Some documentation on how the 2nd arg is used would be a good John> idea, as would be an explanation of the relationship with the John> undocumented "preferred" attribute: Agreed. I seem to recall you're the author. Got some text? >>> csv.Sniffer().preferred [',', '\t', ';', ' ', ':'] John> (4) Too late to change now, but having a class with no args to its John> constructor and only one other method has a whiff of some John> other language :-) It's not too late to add an optional preferred arg to the constructor. John> (5) But the doco is not correct, there are 2 non-constructor John> methods: Yeah, I already noticed and fixed that. That was easy. ;-) Skip From fdrake at acm.org Thu Dec 29 03:19:49 2005 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Wed, 28 Dec 2005 21:19:49 -0500 Subject: [Csv] Sniffer empty delimiter In-Reply-To: <17331.10454.131254.852426@montanaro.dyndns.org> References: <17330.47645.32970.405332@montanaro.dyndns.org> <43B3108F.2010403@lexicon.net> <17331.10454.131254.852426@montanaro.dyndns.org> Message-ID: <200512282119.50120.fdrake@acm.org> On Wednesday 28 December 2005 19:07, skip at pobox.com wrote: > arbitrary csv files) I'd hate for it to fail in that one case. For that > case maybe we can define an optional default arg that is a single > character. Failing all other tests, the default is returned. The default shouldn't be type-checked (including string length), but should simply be returned if provided. This allows the caller to determine the significance of getting back the passed-in value. I guess you could think of it as similar to the third argument of getattr(). :-) -Fred -- Fred L. Drake, Jr. From fdrake at acm.org Thu Dec 29 03:19:49 2005 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Wed, 28 Dec 2005 21:19:49 -0500 Subject: [Csv] Sniffer empty delimiter In-Reply-To: <17331.10454.131254.852426@montanaro.dyndns.org> References: <17330.47645.32970.405332@montanaro.dyndns.org> <43B3108F.2010403@lexicon.net> <17331.10454.131254.852426@montanaro.dyndns.org> Message-ID: <200512282119.50120.fdrake@acm.org> On Wednesday 28 December 2005 19:07, skip at pobox.com wrote: > arbitrary csv files) I'd hate for it to fail in that one case. For that > case maybe we can define an optional default arg that is a single > character. Failing all other tests, the default is returned. The default shouldn't be type-checked (including string length), but should simply be returned if provided. This allows the caller to determine the significance of getting back the passed-in value. I guess you could think of it as similar to the third argument of getattr(). :-) -Fred -- Fred L. Drake, Jr. From skip at pobox.com Thu Dec 29 05:59:14 2005 From: skip at pobox.com (skip at pobox.com) Date: Wed, 28 Dec 2005 22:59:14 -0600 Subject: [Csv] Sniffer empty delimiter In-Reply-To: <200512282119.50120.fdrake@acm.org> References: <17330.47645.32970.405332@montanaro.dyndns.org> <43B3108F.2010403@lexicon.net> <17331.10454.131254.852426@montanaro.dyndns.org> <200512282119.50120.fdrake@acm.org> Message-ID: <17331.27938.293758.849779@montanaro.dyndns.org> >> For that case maybe we can define an optional default arg that is a >> single character. Failing all other tests, the default is returned. Fred> The default shouldn't be type-checked (including string length), Fred> but should simply be returned if provided. This allows the caller Fred> to determine the significance of getting back the passed-in value. Hmmm... To preserve current (incorrect?) behavior I think the default almost has to be "". To be useful though, it has to be a single-character string given the current limitations of the module. Skip From skip at pobox.com Thu Dec 29 05:59:14 2005 From: skip at pobox.com (skip at pobox.com) Date: Wed, 28 Dec 2005 22:59:14 -0600 Subject: [Csv] Sniffer empty delimiter In-Reply-To: <200512282119.50120.fdrake@acm.org> References: <17330.47645.32970.405332@montanaro.dyndns.org> <43B3108F.2010403@lexicon.net> <17331.10454.131254.852426@montanaro.dyndns.org> <200512282119.50120.fdrake@acm.org> Message-ID: <17331.27938.293758.849779@montanaro.dyndns.org> >> For that case maybe we can define an optional default arg that is a >> single character. Failing all other tests, the default is returned. Fred> The default shouldn't be type-checked (including string length), Fred> but should simply be returned if provided. This allows the caller Fred> to determine the significance of getting back the passed-in value. Hmmm... To preserve current (incorrect?) behavior I think the default almost has to be "". To be useful though, it has to be a single-character string given the current limitations of the module. Skip From fdrake at acm.org Thu Dec 29 07:23:28 2005 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Thu, 29 Dec 2005 01:23:28 -0500 Subject: [Csv] Sniffer empty delimiter In-Reply-To: <17331.27938.293758.849779@montanaro.dyndns.org> References: <17330.47645.32970.405332@montanaro.dyndns.org> <200512282119.50120.fdrake@acm.org> <17331.27938.293758.849779@montanaro.dyndns.org> Message-ID: <200512290123.29482.fdrake@acm.org> On Wednesday 28 December 2005 23:59, skip at pobox.com wrote: > Hmmm... To preserve current (incorrect?) behavior I think the default > almost has to be "". To be useful though, it has to be a single-character > string given the current limitations of the module. That's a reasonable requirement for a delimiter used for parsing, and I'm not suggesting that that not be a requirement for that. But if it's a marker object so the caller can determine that no delimiter was determined, then it's still up to the caller to check for that and either not parse or deal with it some other way (ask the user, for instance). I'm not sure it's a big deal, but that's my thought on the matter at any rate. ;-) -Fred -- Fred L. Drake, Jr. From fdrake at acm.org Thu Dec 29 07:23:28 2005 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Thu, 29 Dec 2005 01:23:28 -0500 Subject: [Csv] Sniffer empty delimiter In-Reply-To: <17331.27938.293758.849779@montanaro.dyndns.org> References: <17330.47645.32970.405332@montanaro.dyndns.org> <200512282119.50120.fdrake@acm.org> <17331.27938.293758.849779@montanaro.dyndns.org> Message-ID: <200512290123.29482.fdrake@acm.org> On Wednesday 28 December 2005 23:59, skip at pobox.com wrote: > Hmmm... To preserve current (incorrect?) behavior I think the default > almost has to be "". To be useful though, it has to be a single-character > string given the current limitations of the module. That's a reasonable requirement for a delimiter used for parsing, and I'm not suggesting that that not be a requirement for that. But if it's a marker object so the caller can determine that no delimiter was determined, then it's still up to the caller to check for that and either not parse or deal with it some other way (ask the user, for instance). I'm not sure it's a big deal, but that's my thought on the matter at any rate. ;-) -Fred -- Fred L. Drake, Jr. From sjmachin at lexicon.net Thu Dec 29 08:25:28 2005 From: sjmachin at lexicon.net (John Machin) Date: Thu, 29 Dec 2005 18:25:28 +1100 Subject: [Csv] Sniffer empty delimiter In-Reply-To: <17331.10454.131254.852426@montanaro.dyndns.org> References: <17330.47645.32970.405332@montanaro.dyndns.org> <43B3108F.2010403@lexicon.net> <17331.10454.131254.852426@montanaro.dyndns.org> Message-ID: <43B38F68.9030603@lexicon.net> skip at pobox.com wrote: > Python 2.4.2 (#67, Sep 28 2005, 12:41:11) [MSC v.1310 32 bit (Intel)] on win32 > Type "help", "copyright", "credits" or "license" for more information. > >>> import csv > >>> d = csv.Sniffer().sniff('a|b|c|d|e', ['\t', ',']) > >>> d.delimiter > '' > >>> d = csv.Sniffer().sniff('a|b|c|d|e') > >>> d.delimiter > 'a' > > Both of these seem wrong to me at some level. I tend to agree with you that > if the delimiter fails it should raise an exception, certainly if the > delimiters argument defines a set of characters from which the actual > delimiter must be chosen (does it?). I've got no idea what the delimiters argument is for. That's why I suggested it be documented. Contrary to your recollection, I am *not* the author of any part of the csv module. > The second has to be considered a bug > doesn't it? Yes. I regard the notion of an alphanumeric character being a delimiter as utterly preposterous. > > John> (1) IMHO it should *NEVER* return an alphabetic or numeric > John> character as the delimiter. > > Probably a good rule of thumb. > > John> (2) If there is insufficient sample to determine the dialect's > John> attributes, then it shouldn't pluck them out of the air, with > John> no indication to the caller that there might be a problem. IOW > John> I don't like the "remedies" of "return standard delimiter" and > John> "return first delimiter". It should raise csv.Error; the > John> discerning caller can then take appropriate action. > > If I have a csv file that happens to only have one column and I'm using the > sniffer (presumably because I have an app that processes somewhat arbitrary > csv files) I'd hate for it to fail in that one case. For that case maybe we > can define an optional default arg that is a single character. Failing all > other tests, the default is returned. Optional default arg *plus* an exception? Holy redundancy, Batman! Caller can do this: try: d = csv.Sniffer().sniff(sample) except csv.Error: d = my_default_dialect > > John> (3) Some documentation on how the 2nd arg is used would be a good > John> idea, as would be an explanation of the relationship with the > John> undocumented "preferred" attribute: > > Agreed. I seem to recall you're the author. Got some text? Not so. In fact I'd not even used the sniffer before today. > > >>> csv.Sniffer().preferred > [',', '\t', ';', ' ', ':'] > > John> (4) Too late to change now, but having a class with no args to its > John> constructor and only one other method has a whiff of some > John> other language :-) > > It's not too late to add an optional preferred arg to the constructor. Maybe it's even not too late get some feedback from the actual users and to spec out the sniffer a bit more rigorously and then ensure it meets that spec. Cheers, John From skip at pobox.com Thu Dec 29 14:32:47 2005 From: skip at pobox.com (skip at pobox.com) Date: Thu, 29 Dec 2005 07:32:47 -0600 Subject: [Csv] Sniffer empty delimiter In-Reply-To: <43B38F68.9030603@lexicon.net> References: <17330.47645.32970.405332@montanaro.dyndns.org> <43B3108F.2010403@lexicon.net> <17331.10454.131254.852426@montanaro.dyndns.org> <43B38F68.9030603@lexicon.net> Message-ID: <17331.58751.825698.196812@montanaro.dyndns.org> John> Contrary to your recollection, I am *not* the author of any part John> of the csv module. Ah, sorry about that. I'm not the sniffer's author (never used it in fact). John> Optional default arg *plus* an exception? Holy redundancy, Batman! Well, yeah, it's an either/or sort of thing. I was thinking out loud. John> Caller can do this: John> try: John> d = csv.Sniffer().sniff(sample) John> except csv.Error: John> d = my_default_dialect Yeah, but today that code would be written d = csv.Sniffer().sniff(sample) try: rdr = csv.reader(f, d) except TypeError: blah blah blah so there's a backwards compatibility problem since the exception is raised by the reader class, not the sniffer. John> (3) Some documentation on how the 2nd arg is used would be a good John> idea, as would be an explanation of the relationship with the John> undocumented "preferred" attribute: >> >> Agreed. I seem to recall you're the author. Got some text? John> Not so. In fact I'd not even used the sniffer before today. Unfortunately, neither have I. John> Maybe it's even not too late get some feedback from the actual John> users and to spec out the sniffer a bit more rigorously and then John> ensure it meets that spec. That sounds good as well. If the API is going to change, might as well change it in a useful, non-speculative direction. Skip From skip at pobox.com Fri Dec 30 06:19:28 2005 From: skip at pobox.com (skip at pobox.com) Date: Thu, 29 Dec 2005 23:19:28 -0600 Subject: [Csv] improvement(?) to Sniffer._guess_delimiter() Message-ID: <17332.50016.909334.960552@montanaro.dyndns.org> I just checked in a change to csv.py (svn revision 41849). Previously, the sniffer returned "a" as the delimiter for this sample a|b|c\r\nd|e|f\r\n Now it correctly returns "|", but I don't know if my code is any better than the original. The description of what _guess_delimiter() does (which I don't really understand) is in its doc string. The key change is in the "punt" section of the code at the end. All other attempts to select a delimiter have failed. I just punt differently than the original code: - # finally, just return the first damn character in the list - delim = delims.keys()[0] + # nothing else indicates a preference, pick the character that + # dominates(?) + items = [(v,k) for (k,v) in delims.items()] + items.sort() + delim = items[-1][1] Is my change an actual improvement or just serendipity? Thx, Skip