From andrewm at object-craft.com.au Wed Jan 5 08:06:43 2005 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Wed, 05 Jan 2005 18:06:43 +1100 Subject: [Csv] csv module TODO list Message-ID: <20050105070643.5915B3C8E5@coffee.object-craft.com.au> There's a bunch of jobs we (CSV module maintainers) have been putting off - attached is a list (in no particular order): * unicode support (this will probably uglify the code considerably). * 8 bit transparency (specifically, allow \0 characters in source string and as delimiters, etc). * Reader and universal newlines don't interact well, reader doesn't honour Dialect's lineterminator setting. All outstanding bug id's (789519, 944890, 967934 and 1072404) are related to this - it's a difficult problem and further discussion is needed. * compare PEP-305 and library reference manual to the module as implemented and either document the differences or correct them. * Address or document Francis Avila's issues as mentioned in this posting: http://www.google.com.au/groups?selm=vsb89q1d3n5qb1%40corp.supernews.com * Several blogs complain that the CSV module is no good for parsing strings. Suggest making it clearer in the documentation that the reader accepts an iterable, rather than a file, and document why an iterable (as opposed to a string) is necessary (multi-line records with embedded newlines). We could also provide an interface that parses a single string (or the old Object Craft interface) for those that really feel the need. See: http://radio.weblogs.com/0124960/2003/09/12.html http://zephyrfalcon.org/weblog/arch_d7_2003_09_06.html#e335 * Compatability API for old Object Craft CSV module? http://mechanicalcat.net/cgi-bin/log/2003/08/18 For example: "from csv.legacy import reader" or something. * Pure python implementation? * Some CSV-like formats consider a quoted field a string, and an unquoted field a number - consider supporting this in the Reader and Writer. See: http://radio.weblogs.com/0124960/2004/04/23.html * Add line number and record number counters to reader object? * it's possible to get the csv parser to suck the whole source file into memory with an unmatched quote character. Need to limit size of internal buffer. Also, review comments from Neal Norwitz, 22 Mar 2003 (some of these should already have been addressed): * remove TODO comment at top of file--it's empty * is CSV going to be maintained outside the python tree? If not, remove the 2.2 compatibility macros for: PyDoc_STR, PyDoc_STRVAR, PyMODINIT_FUNC, etc. * inline the following functions since they are used only in one place get_string, set_string, get_nullchar_as_None, set_nullchar_as_None, join_reset (maybe) * rather than use PyErr_BadArgument, should you use assert? (first example, Dialect_set_quoting, line 218) * is it necessary to have Dialect_methods, can you use 0 for tp_methods? * remove commented out code (PyMem_DEL) on line 261 Have you used valgrind on the test to find memory overwrites/leaks? * PyString_AsString()[0] on line 331 could return NULL in which case you are dereferencing a NULL pointer * note sure why there are casts on 0 pointers lines 383-393, 733-743, 1144-1154, 1164-1165 * Reader_getiter() can be removed and use PyObject_SelfIter() * I think you need PyErr_NoMemory() before returning on line 768, 1178 * is PyString_AsString(self->dialect->lineterminator) on line 994 guaranteed not to return NULL? If not, it could crash by passing to memmove. * PyString_AsString() can return NULL on line 1048 and 1063, the result is passed to join_append() * iteratable should be iterable? (line 1088) * why doesn't csv_writerows() have a docstring? csv_writerow does * any PyUnicode_* methods should be protected with #ifdef Py_USING_UNICODE * csv_unregister_dialect, csv_get_dialect could use METH_O so you don't need to use PyArg_ParseTuple * in init_csv, recommend using PyModule_AddIntConstant and PyModule_AddStringConstant where appropriate Also, review comments from Jeremy Hylton, 10 Apr 2003: I've been reviewing extension modules looking for C types that should participate in garbage collection. I think the csv ReaderObj and WriterObj should participate. The ReaderObj it contains a reference to input_iter that could be an arbitrary Python object. The iterator object could well participate in a cycle that refers to the ReaderObj. The WriterObj has a reference to a writeline callable, which could well be a method of an object that also points to the WriterObj. The Dialect object appears to be safe, because the only PyObject * it refers should be a string. Safe until someone creates an insane string subclass <0.4 wink>. Also, an unrelated comment about the code, the lineterminator of the Dialect is managed by a collection of little helper functions like get_string, set_string, etc. This code appears to be excessively general; since they're called only once, it seems clearer to inline the logic directly in the get/set methods for the lineterminator. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From skip at pobox.com Wed Jan 5 08:33:04 2005 From: skip at pobox.com (Skip Montanaro) Date: Wed, 5 Jan 2005 01:33:04 -0600 Subject: [Csv] csv module TODO list In-Reply-To: <20050105070643.5915B3C8E5@coffee.object-craft.com.au> References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au> Message-ID: <16859.38960.9935.682429@montanaro.dyndns.org> Andrew> There's a bunch of jobs we (CSV module maintainers) have been Andrew> putting off - attached is a list (in no particular order): ... In addition, it occurred to me this evening that there's functionality in the csv module I don't think anybody uses. For example, you can register CSV dialects by name, then pass in the string name instead of the dialect class. I'd be in favor of scrapping list_dialects, register_dialect and unregister_dialect altogether. While they are probably trivial little functions I don't think they add much if anything to the implementation and just complicate the _csv extension module slightly. I'm also not aware that anyone really uses the Sniffer class, though it does provide some useful functionality should you need to analyze random CSV files. Skip From andrewm at object-craft.com.au Wed Jan 5 08:55:06 2005 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Wed, 05 Jan 2005 18:55:06 +1100 Subject: [Python-Dev] Re: [Csv] csv module TODO list In-Reply-To: <16859.38960.9935.682429@montanaro.dyndns.org> References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au> <16859.38960.9935.682429@montanaro.dyndns.org> Message-ID: <20050105075506.314C93C8E5@coffee.object-craft.com.au> > Andrew> There's a bunch of jobs we (CSV module maintainers) have been > Andrew> putting off - attached is a list (in no particular order): > ... > >In addition, it occurred to me this evening that there's functionality in >the csv module I don't think anybody uses. It's very difficult to say for sure that nobody is using it once it's released to the world. >For example, you can register CSV dialects by name, then pass in the >string name instead of the dialect class. I'd be in favor of scrapping >list_dialects, register_dialect and unregister_dialect altogether. While >they are probably trivial little functions I don't think they add much if >anything to the implementation and just complicate the _csv extension >module slightly. Yes, in hindsight, they're not really necessary, although I'm sure we had some motivation for them initially. That said, they're there now, and they shouldn't require much maintenance. >I'm also not aware that anyone really uses the Sniffer class, though it >does provide some useful functionality should you need to analyze random >CSV files. The comment I get repeatedly is that they don't use it because it's "too magic/scary". That's as it should be. But if it didn't exist, then someone would be requesting we add it... 8-) -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Wed Jan 5 10:34:14 2005 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Wed, 05 Jan 2005 20:34:14 +1100 Subject: [Csv] Re: [Python-Dev] csv module TODO list In-Reply-To: <41DBAF06.6020401@egenix.com> References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au> <41DBA7D0.80101@v.loewis.de> <41DBAF06.6020401@egenix.com> Message-ID: <20050105093414.00DFF3C8E5@coffee.object-craft.com.au> >> Andrew McNamara wrote: >>> There's a bunch of jobs we (CSV module maintainers) have been putting >>> off - attached is a list (in no particular order): >>> * unicode support (this will probably uglify the code considerably). >> >Martin v. L?wis wrote: >> Can you please elaborate on that? What needs to be done, and how is >> that going to be done? It might be possible to avoid considerable >> uglification. I'm not altogether sure there. The parsing state machine is all written in C, and deals with signed chars - I expect we'll need two versions of that (or one version that's compiled twice using pre-processor macros). Quite a large job. Suggestions gratefully received. M.-A. Lemburg wrote: >Indeed. The trick is to convert to Unicode early and to use Unicode >literals instead of string literals in the code. Yes, although it would be nice to also retain the 8-bit versions as well. >Note that the only real-life Unicode format in use is UTF-16 >(with BOM mark) written by Excel. Note that there's no standard >for specifying the encoding in CSV files, so this is also the only >feasable format. Yes - that's part of the problem I hadn't really thought about yet - the csv module currently interacts directly with files as iterators, but it's clear that we'll need to decode as we go. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From martin at v.loewis.de Wed Jan 5 09:39:44 2005 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 05 Jan 2005 09:39:44 +0100 Subject: [Csv] Re: [Python-Dev] csv module TODO list In-Reply-To: <20050105070643.5915B3C8E5@coffee.object-craft.com.au> References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au> Message-ID: <41DBA7D0.80101@v.loewis.de> Andrew McNamara wrote: > There's a bunch of jobs we (CSV module maintainers) have been putting > off - attached is a list (in no particular order): > > * unicode support (this will probably uglify the code considerably). Can you please elaborate on that? What needs to be done, and how is that going to be done? It might be possible to avoid considerable uglification. Regards, Martin From mal at egenix.com Wed Jan 5 10:10:30 2005 From: mal at egenix.com (M.-A. Lemburg) Date: Wed, 05 Jan 2005 10:10:30 +0100 Subject: [Csv] Re: [Python-Dev] csv module TODO list In-Reply-To: <41DBA7D0.80101@v.loewis.de> References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au> <41DBA7D0.80101@v.loewis.de> Message-ID: <41DBAF06.6020401@egenix.com> Martin v. L?wis wrote: > Andrew McNamara wrote: > >> There's a bunch of jobs we (CSV module maintainers) have been putting >> off - attached is a list (in no particular order): >> * unicode support (this will probably uglify the code considerably). > > > Can you please elaborate on that? What needs to be done, and how is > that going to be done? It might be possible to avoid considerable > uglification. Indeed. The trick is to convert to Unicode early and to use Unicode literals instead of string literals in the code. Note that the only real-life Unicode format in use is UTF-16 (with BOM mark) written by Excel. Note that there's no standard for specifying the encoding in CSV files, so this is also the only feasable format. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jan 05 2005) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! :::: From mal at egenix.com Wed Jan 5 10:44:40 2005 From: mal at egenix.com (M.-A. Lemburg) Date: Wed, 05 Jan 2005 10:44:40 +0100 Subject: [Csv] Re: [Python-Dev] csv module TODO list In-Reply-To: <20050105093414.00DFF3C8E5@coffee.object-craft.com.au> References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au> <41DBA7D0.80101@v.loewis.de> <41DBAF06.6020401@egenix.com> <20050105093414.00DFF3C8E5@coffee.object-craft.com.au> Message-ID: <41DBB708.5030501@egenix.com> Andrew McNamara wrote: >>>Andrew McNamara wrote: >>> >>>>There's a bunch of jobs we (CSV module maintainers) have been putting >>>>off - attached is a list (in no particular order): >>>>* unicode support (this will probably uglify the code considerably). >>> >>Martin v. L?wis wrote: >> >>>Can you please elaborate on that? What needs to be done, and how is >>>that going to be done? It might be possible to avoid considerable >>>uglification. > > > I'm not altogether sure there. The parsing state machine is all written in > C, and deals with signed chars - I expect we'll need two versions of that > (or one version that's compiled twice using pre-processor macros). Quite > a large job. Suggestions gratefully received. > > M.-A. Lemburg wrote: > >>Indeed. The trick is to convert to Unicode early and to use Unicode >>literals instead of string literals in the code. > > > Yes, although it would be nice to also retain the 8-bit versions as well. You can do so by using latin-1 as default encoding. Works great ! >>Note that the only real-life Unicode format in use is UTF-16 >>(with BOM mark) written by Excel. Note that there's no standard >>for specifying the encoding in CSV files, so this is also the only >>feasable format. > > Yes - that's part of the problem I hadn't really thought about yet - the > csv module currently interacts directly with files as iterators, but it's > clear that we'll need to decode as we go. Depends on your needs: CSV files tend to be small enough to do the decoding in one call in memory. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jan 05 2005) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! :::: From andrewm at object-craft.com.au Wed Jan 5 11:03:25 2005 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Wed, 05 Jan 2005 21:03:25 +1100 Subject: [Csv] Re: [Python-Dev] csv module TODO list In-Reply-To: <41DBB708.5030501@egenix.com> References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au> <41DBA7D0.80101@v.loewis.de> <41DBAF06.6020401@egenix.com> <20050105093414.00DFF3C8E5@coffee.object-craft.com.au> <41DBB708.5030501@egenix.com> Message-ID: <20050105100325.A220D3C8E5@coffee.object-craft.com.au> >> Yes, although it would be nice to also retain the 8-bit versions as well. > >You can do so by using latin-1 as default encoding. Works great ! Yep, although that means we wear the cost of decoding and encoding for all 8 bit input. What does the _sre.c code do? >Depends on your needs: CSV files tend to be small enough >to do the decoding in one call in memory. We are routinely dealing with multi-gigabyte csv files - which is why the original 2001 vintage csv module was written as a C state machine. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From mal at egenix.com Wed Jan 5 11:16:50 2005 From: mal at egenix.com (M.-A. Lemburg) Date: Wed, 05 Jan 2005 11:16:50 +0100 Subject: [Csv] Re: [Python-Dev] csv module TODO list In-Reply-To: <20050105100325.A220D3C8E5@coffee.object-craft.com.au> References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au> <41DBA7D0.80101@v.loewis.de> <41DBAF06.6020401@egenix.com> <20050105093414.00DFF3C8E5@coffee.object-craft.com.au> <41DBB708.5030501@egenix.com> <20050105100325.A220D3C8E5@coffee.object-craft.com.au> Message-ID: <41DBBE92.4070106@egenix.com> Andrew McNamara wrote: >>>Yes, although it would be nice to also retain the 8-bit versions as well. >> >>You can do so by using latin-1 as default encoding. Works great ! > > Yep, although that means we wear the cost of decoding and encoding for > all 8 bit input. Right, but it makes the code very clean and straight forward. Again, it depends on what you need. If performance is critical then you probably need a C version written using the same trick as _sre.c... > What does the _sre.c code do? It comes in two versions: one for 8-bit the other for Unicode. >>Depends on your needs: CSV files tend to be small enough >>to do the decoding in one call in memory. > > We are routinely dealing with multi-gigabyte csv files - which is why the > original 2001 vintage csv module was written as a C state machine. I see, but are you sure that the typical Python user will have the same requirements to make it worth the effort (and complexity) ? I've written a few CSV parsers and writers myself over the years and the requirements were different every time, in terms of being flexible in the parsing phase, the interfaces and the performance needs. Haven't yet found a one fits all solution and don't really expect to any more :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jan 05 2005) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! :::: From andrewm at object-craft.com.au Wed Jan 5 11:33:05 2005 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Wed, 05 Jan 2005 21:33:05 +1100 Subject: [Csv] Re: [Python-Dev] csv module TODO list In-Reply-To: <41DBBE92.4070106@egenix.com> References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au> <41DBA7D0.80101@v.loewis.de> <41DBAF06.6020401@egenix.com> <20050105093414.00DFF3C8E5@coffee.object-craft.com.au> <41DBB708.5030501@egenix.com> <20050105100325.A220D3C8E5@coffee.object-craft.com.au> <41DBBE92.4070106@egenix.com> Message-ID: <20050105103305.AD80B3C8E5@coffee.object-craft.com.au> >> Yep, although that means we wear the cost of decoding and encoding for >> all 8 bit input. > >Right, but it makes the code very clean and straight forward. I agree it makes for a very clean solution, and 99% of the time I'd chose that option. >Again, it depends on what you need. If performance is critical >then you probably need a C version written using the same trick >as _sre.c... > >> What does the _sre.c code do? > >It comes in two versions: one for 8-bit the other for Unicode. That's what I thought. I think the motivations here are similar to those that drove the _sre developers. >> We are routinely dealing with multi-gigabyte csv files - which is why the >> original 2001 vintage csv module was written as a C state machine. > >I see, but are you sure that the typical Python user will have >the same requirements to make it worth the effort (and >complexity) ? This is open source, so I scratch my own itch (and that of my employers) - we need fast csv parsing more than we need unicode... 8-) Okay, assuming we go the "produce two versions via evil macro tricks" path, it's still not quite the same situation as _sre.c, which only has to deal with the internal unicode representation. One way to approach this would be to add an "encoding" keyword argument to the readers and writers. If given, the parser would decode the input stream to the internal representation before passing it through the unicode state machine, which would yield tuples of unicode objects. That leaves us with a bit of a problem where the source is already unicode (eg, a list of unicode strings)... hmm. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From sjmachin at lexicon.net Wed Jan 5 11:41:19 2005 From: sjmachin at lexicon.net (sjmachin at lexicon.net) Date: Wed, 05 Jan 2005 21:41:19 +1100 Subject: [Csv] Re: [Python-Dev] csv module TODO list In-Reply-To: <41DBB708.5030501@egenix.com> References: <20050105093414.00DFF3C8E5@coffee.object-craft.com.au> Message-ID: <41DC5EFF.28236.36EF6D7@localhost> On 5 Jan 2005 at 10:44, M.-A. Lemburg wrote: > > Depends on your needs: CSV files tend to be small enough > to do the decoding in one call in memory. > The CSV format is often used for exchanging large data files, not just for spreadsheet output. My experience: files with over a million rows are not uncommon. FWIW, no Unicode. My (jaundiced, but based on experience) viewpoint on newlines inside quoted strings: Prob (spreadsheet file with newlines inside data fields) = 0.001 Prob (some programmer has not quoted their quotes properly) = 0.999 Hence I suggest an option to specify this as a bug. Regards, John From andrewm at object-craft.com.au Wed Jan 5 12:08:49 2005 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Wed, 05 Jan 2005 22:08:49 +1100 Subject: [Csv] csv module TODO list In-Reply-To: <20050105070643.5915B3C8E5@coffee.object-craft.com.au> References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au> Message-ID: <20050105110849.CBA843C8E5@coffee.object-craft.com.au> >Also, review comments from Neal Norwitz, 22 Mar 2003 (some of these should >already have been addressed): I should apologise to Neal here for not replying to him at the time. Okay, going though the issues Neal raised... >* remove TODO comment at top of file--it's empty Was fixed. >* is CSV going to be maintained outside the python tree? > If not, remove the 2.2 compatibility macros for: > PyDoc_STR, PyDoc_STRVAR, PyMODINIT_FUNC, etc. Does anyone thing we should continue to maintain this 2.2 compatibility? >* inline the following functions since they are used only in one place > get_string, set_string, get_nullchar_as_None, set_nullchar_as_None, > join_reset (maybe) It was done that way as I felt we would be adding more getters and setters to the dialect object in future. >* rather than use PyErr_BadArgument, should you use assert? > (first example, Dialect_set_quoting, line 218) You mean C assert()? I don't think I'm really following you here - where would the type of the object be checked in a way the user could recover from? >* is it necessary to have Dialect_methods, can you use 0 for tp_methods? I was assuming I would need to add methods at some point (in fact, I did have methods, but removed them). >* remove commented out code (PyMem_DEL) on line 261 > Have you used valgrind on the test to find memory overwrites/leaks? No, valgrind wasn't used. >* PyString_AsString()[0] on line 331 could return NULL in which case > you are dereferencing a NULL pointer Was fixed. >* note sure why there are casts on 0 pointers > lines 383-393, 733-743, 1144-1154, 1164-1165 To make it easier when the time comes to add one of those members. >* Reader_getiter() can be removed and use PyObject_SelfIter() Okay, wasn't aware of PyObject_SelfIter - will fix. >* I think you need PyErr_NoMemory() before returning on line 768, 1178 The examples I looked at in the Python core didn't do this - are you sure? (now lines 832 and 1280). >* is PyString_AsString(self->dialect->lineterminator) on line 994 > guaranteed not to return NULL? If not, it could crash by > passing to memmove. >* PyString_AsString() can return NULL on line 1048 and 1063, > the result is passed to join_append() Looking at the PyString_AsString implementation, it looks safe (we ensure it's really a string elsewhere)? >* iteratable should be iterable? (line 1088) Sorry, I don't know what you're getting at here? (now line 1162). >* why doesn't csv_writerows() have a docstring? csv_writerow does Was fixed. >* any PyUnicode_* methods should be protected with #ifdef Py_USING_UNICODE Was fixed. >* csv_unregister_dialect, csv_get_dialect could use METH_O > so you don't need to use PyArg_ParseTuple Was fixed. >* in init_csv, recommend using > PyModule_AddIntConstant and PyModule_AddStringConstant > where appropriate Was fixed. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Wed Jan 5 12:14:02 2005 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Wed, 05 Jan 2005 22:14:02 +1100 Subject: [Csv] Re: [Python-Dev] csv module TODO list In-Reply-To: <41DC5EFF.28236.36EF6D7@localhost> References: <20050105093414.00DFF3C8E5@coffee.object-craft.com.au> <41DC5EFF.28236.36EF6D7@localhost> Message-ID: <20050105111402.A319C3C8E6@coffee.object-craft.com.au> >The CSV format is often used for exchanging large data files, not just for >spreadsheet output. > >My experience: files with over a million rows are not uncommon. FWIW, no >Unicode. Matches my experience also, but I suspect we both live in English speaking countries. Elsewhere in the world, the ratios could be reversed. There has also been some suggestion that the native string type in Python will become Unicode at some point in the future. >My (jaundiced, but based on experience) viewpoint on newlines inside >quoted strings: > >Prob (spreadsheet file with newlines inside data fields) = 0.001 > >Prob (some programmer has not quoted their quotes properly) = 0.999 > >Hence I suggest an option to specify this as a bug. I agree. What makes this extra exciting at the moment is that the CSV module will happily sit there slurping the whole file into memory trying to match a stray quote (of course, I only noticed this when trying to read a multi-gigabyte file). -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From mal at egenix.com Wed Jan 5 13:08:35 2005 From: mal at egenix.com (M.-A. Lemburg) Date: Wed, 05 Jan 2005 13:08:35 +0100 Subject: [Csv] Re: [Python-Dev] csv module TODO list In-Reply-To: <20050105111402.A319C3C8E6@coffee.object-craft.com.au> References: <20050105093414.00DFF3C8E5@coffee.object-craft.com.au> <41DC5EFF.28236.36EF6D7@localhost> <20050105111402.A319C3C8E6@coffee.object-craft.com.au> Message-ID: <41DBD8C3.5090303@egenix.com> Andrew McNamara wrote: >>The CSV format is often used for exchanging large data files, not just for >>spreadsheet output. >> >>My experience: files with over a million rows are not uncommon. FWIW, no >>Unicode. > > Matches my experience also, but I suspect we both live in English speaking > countries. Elsewhere in the world, the ratios could be reversed. Hmm, wasn't XML intended to replace CSV (among other formats) for exchanging tons of data ;-) As I mentioned before, there's no such thing as the one fits all general CSV parser or writer. If Unicode CSV data is not common enough, you might want to provide a solution based on a UTF-8 string encoding - a decoder could convert the input stream to UTF-8, you then process that data using the existing CSV parser and then convert it back to Unicode in the .next() method. So far, I've only ever used Unicode CSV data for exchange with Asian language spreadsheets. > There has also been some suggestion that the native string type in Python > will become Unicode at some point in the future. Indeed :-) >>My (jaundiced, but based on experience) viewpoint on newlines inside >>quoted strings: >> >>Prob (spreadsheet file with newlines inside data fields) = 0.001 >> >>Prob (some programmer has not quoted their quotes properly) = 0.999 >> >>Hence I suggest an option to specify this as a bug. > > I agree. What makes this extra exciting at the moment is that the CSV > module will happily sit there slurping the whole file into memory trying > to match a stray quote (of course, I only noticed this when trying to > read a multi-gigabyte file). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jan 05 2005) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! :::: From magnus at hetland.org Wed Jan 5 13:19:21 2005 From: magnus at hetland.org (Magnus Lie Hetland) Date: Wed, 5 Jan 2005 13:19:21 +0100 Subject: [Csv] Re: csv module TODO list In-Reply-To: <20050105075506.314C93C8E5@coffee.object-craft.com.au> References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au> <16859.38960.9935.682429@montanaro.dyndns.org> <20050105075506.314C93C8E5@coffee.object-craft.com.au> Message-ID: <20050105121921.GB24030@idi.ntnu.no> Quite a while ago I posted some material to the csv-list about problems using the csv module on Unix-style colon-separated files -- it just doesn't deal properly with backslash escaping and is quite useless for this kind of file. I seem to recall the general view was that it wasn't intended for this kind of thing -- only the sort of csv that Microsoft Excel outputs/inputs, but if I am mistaken about this, perhaps fixing this issue might be put on the TODO-list? I'll be happy to re-send or summarize the relevant emails, if needed. -- Magnus Lie Hetland Fallen flower I see / Returning to its branch http://hetland.org Ah! a butterfly. [Arakida Moritake] From andrewm at object-craft.com.au Wed Jan 5 13:29:11 2005 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Wed, 05 Jan 2005 23:29:11 +1100 Subject: [Csv] Re: csv module TODO list In-Reply-To: <20050105121921.GB24030@idi.ntnu.no> References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au> <16859.38960.9935.682429@montanaro.dyndns.org> <20050105075506.314C93C8E5@coffee.object-craft.com.au> <20050105121921.GB24030@idi.ntnu.no> Message-ID: <20050105122911.83EE93C8E5@coffee.object-craft.com.au> >Quite a while ago I posted some material to the csv-list about >problems using the csv module on Unix-style colon-separated files -- >it just doesn't deal properly with backslash escaping and is quite >useless for this kind of file. I seem to recall the general view was >that it wasn't intended for this kind of thing -- only the sort of csv >that Microsoft Excel outputs/inputs, but if I am mistaken about this, >perhaps fixing this issue might be put on the TODO-list? I'll be happy >to re-send or summarize the relevant emails, if needed. I think a related issue was included in my TODO list: >* Address or document Francis Avila's issues as mentioned in this posting: > > http://www.google.com.au/groups?selm=vsb89q1d3n5qb1%40corp.supernews.com -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From martin at v.loewis.de Wed Jan 5 23:00:26 2005 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 05 Jan 2005 23:00:26 +0100 Subject: [Csv] Re: [Python-Dev] csv module TODO list In-Reply-To: <20050105093414.00DFF3C8E5@coffee.object-craft.com.au> References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au> <41DBA7D0.80101@v.loewis.de> <41DBAF06.6020401@egenix.com> <20050105093414.00DFF3C8E5@coffee.object-craft.com.au> Message-ID: <41DC637A.5050105@v.loewis.de> Andrew McNamara wrote: >>>Can you please elaborate on that? What needs to be done, and how is >>>that going to be done? It might be possible to avoid considerable >>>uglification. > > > I'm not altogether sure there. The parsing state machine is all written in > C, and deals with signed chars - I expect we'll need two versions of that > (or one version that's compiled twice using pre-processor macros). Quite > a large job. Suggestions gratefully received. I'm still trying to understand what *needs* to be done - I would move to how this is done only later. What APIs should be extended/changed, and in what way? Regards, Martin From fumanchu at amor.org Wed Jan 5 18:38:52 2005 From: fumanchu at amor.org (Robert Brewer) Date: Wed, 5 Jan 2005 09:38:52 -0800 Subject: [Python-Dev] Re: [Csv] csv module TODO list Message-ID: <3A81C87DC164034AA4E2DDFE11D258E33980EE@exchange.hqamor.amorhq.net> Skip Montanaro wrote: > Andrew> There's a bunch of jobs we (CSV module > maintainers) have been > Andrew> putting off - attached is a list (in no particular order): > > ... > > In addition, it occurred to me this evening that there's > functionality in the csv module I don't think anybody uses. > ... > I'm also not aware that anyone really uses the Sniffer class, > though it does provide some useful functionality should you > need to analyze random CSV files. I used Sniffer quite heavily for my last contract. The client had multiple multigig csv's which needed deduplicating, but they were all from different sources and therefore in different formats. It would have cost me many more hours without the Sniffer. Please keep it. <:) Robert Brewer MIS Amor Ministries fumanchu at amor.org From andrewm at object-craft.com.au Thu Jan 6 02:10:55 2005 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Thu, 06 Jan 2005 12:10:55 +1100 Subject: [Csv] Re: [Python-Dev] csv module TODO list In-Reply-To: <41DC637A.5050105@v.loewis.de> References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au> <41DBA7D0.80101@v.loewis.de> <41DBAF06.6020401@egenix.com> <20050105093414.00DFF3C8E5@coffee.object-craft.com.au> <41DC637A.5050105@v.loewis.de> Message-ID: <20050106011055.001163C8E5@coffee.object-craft.com.au> >>>>Can you please elaborate on that? What needs to be done, and how is >>>>that going to be done? It might be possible to avoid considerable >>>>uglification. >> >> I'm not altogether sure there. The parsing state machine is all written in >> C, and deals with signed chars - I expect we'll need two versions of that >> (or one version that's compiled twice using pre-processor macros). Quite >> a large job. Suggestions gratefully received. > >I'm still trying to understand what *needs* to be done - I would move to >how this is done only later. What APIs should be extended/changed, and >in what way? That's certainly the first step, and I have to admit that I don't have a clear idea at this time - the unicode issue has been in the "too hard" basket since we started. Marc-Andre Lemburg mentioned that he has encountered UTF-16 encoded csv files, so a reasonable starting point would be the ability to read and parse, as well as the ability to generate, one of these. The reader interface currently returns a row at a time, consuming as many lines from the supplied iterable (with the most common iterable being a file). This suggests to me that we will need an optional "encoding" argument to the reader constructor, and that the reader will need to decode the source lines. That said, I'm hardly a unicode expert, so I may be overlooking something (could a utf-16 encoded character span a line break, for example). The writer interface probably should have similar facilities. However - a number of people have complained about the "iterator" interface, wanting to supply strings (the iterable is necessary because a CSV row can span multiple lines). It's also conceiveable that the source lines could already be unicode objects. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Thu Jan 6 03:03:08 2005 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Thu, 06 Jan 2005 13:03:08 +1100 Subject: [Csv] Re: [Python-Dev] csv module TODO list In-Reply-To: <20050106011055.001163C8E5@coffee.object-craft.com.au> References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au> <41DBA7D0.80101@v.loewis.de> <41DBAF06.6020401@egenix.com> <20050105093414.00DFF3C8E5@coffee.object-craft.com.au> <41DC637A.5050105@v.loewis.de> <20050106011055.001163C8E5@coffee.object-craft.com.au> Message-ID: <20050106020308.EBE5A3C8E5@coffee.object-craft.com.au> >>I'm still trying to understand what *needs* to be done - I would move to >>how this is done only later. What APIs should be extended/changed, and >>in what way? [...] >The reader interface currently returns a row at a time, consuming as many >lines from the supplied iterable (with the most common iterable being >a file). This suggests to me that we will need an optional "encoding" >argument to the reader constructor, and that the reader will need to >decode the source lines. That said, I'm hardly a unicode expert, so I >may be overlooking something (could a utf-16 encoded character span a >line break, for example). The writer interface probably should have >similar facilities. Ah - I see that the codecs module provides an EncodedFile class - better to use this than add encoding/decoding cruft to the csv module. So, do we duplicate the current reader and writer as UnicodeReader and UnicodeWriter (how else do we know to use the unicode parser)? What about the "dialects"? I guess if a dialect uses no unicode strings, it can be applied to the current parser, but if it does include unicode strings, then the parser would need to raise an exception. The DictReader and DictWriter classes will probably need matching UnicodeDictReader/UnicodeDictWriter versions (use common base class, just specify alternate parser). -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From martin at v.loewis.de Thu Jan 6 17:05:05 2005 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Thu, 06 Jan 2005 17:05:05 +0100 Subject: [Csv] Re: [Python-Dev] csv module TODO list In-Reply-To: <20050106011055.001163C8E5@coffee.object-craft.com.au> References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au> <41DBA7D0.80101@v.loewis.de> <41DBAF06.6020401@egenix.com> <20050105093414.00DFF3C8E5@coffee.object-craft.com.au> <41DC637A.5050105@v.loewis.de> <20050106011055.001163C8E5@coffee.object-craft.com.au> Message-ID: <41DD61B1.1030507@v.loewis.de> Andrew McNamara wrote: > Marc-Andre Lemburg mentioned that he has encountered UTF-16 encoded csv > files, so a reasonable starting point would be the ability to read and > parse, as well as the ability to generate, one of these. I see. That would be reasonable, indeed. Notice that this is not so much a "Unicode issue", but more an "encoding" issue. If you solve the "arbitrary encodings" problem, you solve UTF-16 as a side effect. > The reader interface currently returns a row at a time, consuming as many > lines from the supplied iterable (with the most common iterable being > a file). This suggests to me that we will need an optional "encoding" > argument to the reader constructor, and that the reader will need to > decode the source lines. Ok. In this context, I see two possible implementation strategies: 1. Implement the csv module two times: once for bytes, and once for Unicode characters. It is likely that the source code would be the same for each case; you just need to make sure the "Dialect and Formatting Parameters" change their width accordingly. If you use the SRE approach, you would do #define CSV_ITEM_T char #define CSV_NAME_PREFIX byte_ #include "csvimpl.c" #define CSV_ITEM_T Py_Unicode #define CSV_NAME_PREFIX unicode_ #include "csvimpl.c" 2. Use just the existing _csv module, and represent non-byte encodings as UTF-8. This will work as long as the delimiters and other markup characters have always a single byte in UTF-8, which is the case for "':\, as well as for \r and \n. Then, wenn processing using an explicit encoding, first convert the input into Unicode objects. Then encode the Unicode objects into UTF-8, and pass it to _csv. For the results you get back, convert each element back from UTF-8 to a Unicode object. This could be implemented as def reader(f, encoding=None): if encoding is None: return _csv.reader(f) enc, dec, reader, writer = codecs.lookup(encoding) utf8_enc, utf8_dec, utf8_r, utf8_w = codecs.lookup("UTF-8") # Make a recoder which can only read utf8_stream = codecs.StreamRecoder(f, utf8_enc, None, Reader, None) csv_reader = _csv.reader(utf8_stream) # For performance reasons, map_result could be implemented in C def map_result(t): result = [None]*len(t) for i, val in enumerate(t): result[i] = utf8_dec(val) return tuple(result) return itertools.imap(map_result, csv_reader) # This code is untested This approach has the disadvantage of performing three recodings: from input charset to Unicode, from Unicode to UTF-8, from UTF-8 to Unicode. One could: - skip the initial recoding if the encoding is already known to be _csv-safe (i.e. if it is a pure ASCII superset). This would be valid for ASCII, iso-8859-n, UTF-8, ... - offer the user to keep the results in the input encoding, instead of always returning Unicode objects. Apart from this disadvantage, I think this gives people what they want: they can specify the encoding of the input, and they get the results not only csv-separated, but also unicode-decode. This approach is the same that is used for Python source code encodings: the source is first recoded into UTF-8, then parsed, then recoded back. > That said, I'm hardly a unicode expert, so I > may be overlooking something (could a utf-16 encoded character span a > line break, for example). This cannot happen: \r, in UTF-16, is also 2 bytes (0D 00, if UTF-16LE). There are issues that Unicode has additional line break characters, which is probably irrelevant. Regards, Martin From ajm at flonidan.dk Thu Jan 6 17:22:12 2005 From: ajm at flonidan.dk (Anders J. Munch) Date: Thu, 6 Jan 2005 17:22:12 +0100 Subject: [Csv] Re: [Python-Dev] csv module TODO list Message-ID: <6D9E824FA10BD411BE95000629EE2EC3C6DE3C@FLONIDAN-MAIL> Andrew McNamara wrote: > > I'm not altogether sure there. The parsing state machine is all > written in C, and deals with signed chars - I expect we'll need two > versions of that (or one version that's compiled twice using > pre-processor macros). Quite a large job. Suggestions gratefully > received. How about using UTF-8 internally? Change nothing in _csv.c, but in csv.py encode/decode any unicode strings into UTF-8 on the way to/from _csv. File-like objects passed in by the user can be wrapped in proxies that take care of encoding and decoding user strings, as well as trans-coding between UTF-8 and the users chosen file encoding. All that coding work may slow things down, but your original fast _csv module will still be there when you need it. - Anders From skip at pobox.com Wed Jan 5 21:21:18 2005 From: skip at pobox.com (Skip Montanaro) Date: Wed, 5 Jan 2005 14:21:18 -0600 Subject: [Csv] Re: csv module TODO list In-Reply-To: <20050105121921.GB24030@idi.ntnu.no> References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au> <16859.38960.9935.682429@montanaro.dyndns.org> <20050105075506.314C93C8E5@coffee.object-craft.com.au> <20050105121921.GB24030@idi.ntnu.no> Message-ID: <16860.19518.824788.613286@montanaro.dyndns.org> Magnus> Quite a while ago I posted some material to the csv-list about Magnus> problems using the csv module on Unix-style colon-separated Magnus> files -- it just doesn't deal properly with backslash escaping Magnus> and is quite useless for this kind of file. I seem to recall the Magnus> general view was that it wasn't intended for this kind of thing Magnus> -- only the sort of csv that Microsoft Excel outputs/inputs, Yes, that's my recollection as well. It's possible that we can extend the interpretation of the escape char. Magnus> I'll be happy to re-send or summarize the relevant emails, if Magnus> needed. Yes, that would be helpful. Can you send me an example (three or four lines) of the sort of file it won't grok? Skip From skip at pobox.com Wed Jan 5 20:34:09 2005 From: skip at pobox.com (Skip Montanaro) Date: Wed, 5 Jan 2005 13:34:09 -0600 Subject: [Csv] csv module TODO list In-Reply-To: <20050105110849.CBA843C8E5@coffee.object-craft.com.au> References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au> <20050105110849.CBA843C8E5@coffee.object-craft.com.au> Message-ID: <16860.16689.695012.975520@montanaro.dyndns.org> >> * is CSV going to be maintained outside the python tree? >> If not, remove the 2.2 compatibility macros for: PyDoc_STR, >> PyDoc_STRVAR, PyMODINIT_FUNC, etc. Andrew> Does anyone thing we should continue to maintain this 2.2 Andrew> compatibility? With the release of 2.4, 2.2 has officially dropped off the radar screen, right (zero probability of a 2.2.n+1 release, though the probability was vanishingly small before). I'd say toss it. Do just that in a single checkin so someone who's interested can do a simple cvs diff to yield an initial patch file for external maintenance of that feature. >> * inline the following functions since they are used only in one >> place get_string, set_string, get_nullchar_as_None, >> set_nullchar_as_None, join_reset (maybe) Andrew> It was done that way as I felt we would be adding more getters Andrew> and setters to the dialect object in future. The only new dialect attribute I envision is an encoding attribute. >> * is it necessary to have Dialect_methods, can you use 0 for tp_methods? Andrew> I was assuming I would need to add methods at some point (in Andrew> fact, I did have methods, but removed them). Dialect objects are really just data containers, right? I don't see that they would need any methods. >> * remove commented out code (PyMem_DEL) on line 261 >> Have you used valgrind on the test to find memory overwrites/leaks? Andrew> No, valgrind wasn't used. I have it here at work. I'll try to find a few minutes to run the csv tests under valgrind's control. Skip From andrewm at object-craft.com.au Fri Jan 7 02:15:33 2005 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Fri, 07 Jan 2005 12:15:33 +1100 Subject: [Csv] CSV module brain surgery Message-ID: <20050107011533.EA2183C8E5@coffee.object-craft.com.au> The "dialect" type in the CSV module had been bugging me for a while - it's used to hold the C-type representation of the parser config, and barely exposed to the user (except as an attribute on the reader and writer objects). There were several problems with this internal dialect type - the primary one was that you could write to it's attributes, which meant that cross-attribute validation was doomed. It also reported errors terribly, typically raising something like "invalid type for builtin" and no more information. So, I rewrote it. The result is far more consistent about the types of exceptions it raises, and provides more useful diagnostics to the user (unfortunately, this means minor user visible change, but probably not in any way that they will notice). The dialect type now does it's own validation of options, so these should better reflect what the parser is capable of (downside is that Skip's python validator reports more than one error per exception, the new version can only raise one). Previously, the conversion from Python types to C types was done in the setter (property) functions, and the type init function called setattr to put it's arguments onto the type, hence the wonky reporting of type errors. The new code makes the type's attributes read-only - they are set directly from the init function, which makes cross-attribute validation viable. Note that the dialect type constructor takes either class instance (and looks on it for the appropriate attributes), and/or keyword arguments. This makes it more complicated that I like, but means you can say stuff like "excel, but tab delimited": csv.reader(file, 'excel', delimiter='\t'). I'm about ready to commit this (and some minor changes to the tests). Comments, please? _csv.c | 423 +++++++-----------!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 1 files changed, 51 insertions(+), 72 deletions(-), 300 modifications(!) Index: Modules/_csv.c =================================================================== RCS file: /cvsroot/python/python/dist/src/Modules/_csv.c,v retrieving revision 1.16 diff -u -r1.16 _csv.c --- Modules/_csv.c 6 Jan 2005 02:25:41 -0000 1.16 +++ Modules/_csv.c 6 Jan 2005 12:39:50 -0000 @@ -73,7 +73,7 @@ char escapechar; /* escape character */ int skipinitialspace; /* ignore spaces following delimiter? */ PyObject *lineterminator; /* string to write between records */ - QuoteStyle quoting; /* style of quoting to write */ + int quoting; /* style of quoting to write */ int strict; /* raise exception on bad CSV */ } DialectObj; @@ -130,17 +130,6 @@ return dialect_obj; } -static int -check_delattr(PyObject *v) -{ - if (v == NULL) { - PyErr_SetString(PyExc_TypeError, - "Cannot delete attribute"); - return -1; - } - return 0; -} - static PyObject * get_string(PyObject *str) { @@ -148,25 +137,6 @@ return str; } -static int -set_string(PyObject **str, PyObject *v) -{ - if (check_delattr(v) < 0) - return -1; - if (!PyString_Check(v) -#ifdef Py_USING_UNICODE -&& !PyUnicode_Check(v) -#endif -) { - PyErr_BadArgument(); - return -1; - } - Py_XDECREF(*str); - Py_INCREF(v); - *str = v; - return 0; -} - static PyObject * get_nullchar_as_None(char c) { @@ -178,48 +148,22 @@ return PyString_FromStringAndSize((char*)&c, 1); } -static int -set_None_as_nullchar(char * addr, PyObject *v) -{ - if (check_delattr(v) < 0) - return -1; - if (v == Py_None) - *addr = '\0'; - else if (!PyString_Check(v) || PyString_Size(v) != 1) { - PyErr_BadArgument(); - return -1; - } - else { - char *s = PyString_AsString(v); - if (s == NULL) - return -1; - *addr = s[0]; - } - return 0; -} - static PyObject * Dialect_get_lineterminator(DialectObj *self) { return get_string(self->lineterminator); } -static int -Dialect_set_lineterminator(DialectObj *self, PyObject *value) -{ - return set_string(&self->lineterminator, value); -} - static PyObject * Dialect_get_escapechar(DialectObj *self) { return get_nullchar_as_None(self->escapechar); } -static int -Dialect_set_escapechar(DialectObj *self, PyObject *value) +static PyObject * +Dialect_get_quotechar(DialectObj *self) { - return set_None_as_nullchar(&self->escapechar, value); + return get_nullchar_as_None(self->quotechar); } static PyObject * @@ -229,51 +173,109 @@ } static int -Dialect_set_quoting(DialectObj *self, PyObject *v) +_set_bool(const char *name, int *target, PyObject *src, int dflt) +{ + if (src == NULL) + *target = dflt; + else + *target = PyObject_IsTrue(src); + return 0; +} + +static int +_set_int(const char *name, int *target, PyObject *src, int dflt) +{ + if (src == NULL) + *target = dflt; + else { + if (!PyInt_Check(src)) { + PyErr_Format(PyExc_TypeError, + "\"%s\" must be an integer", name); + return -1; + } + *target = PyInt_AsLong(src); + } + return 0; +} + +static int +_set_char(const char *name, char *target, PyObject *src, char dflt) +{ + if (src == NULL) + *target = dflt; + else { + if (src == Py_None) + *target = '\0'; + else if (!PyString_Check(src) || PyString_Size(src) != 1) { + PyErr_Format(PyExc_TypeError, + "\"%s\" must be an 1-character string", + name); + return -1; + } + else { + char *s = PyString_AsString(src); + if (s == NULL) + return -1; + *target = s[0]; + } + } + return 0; +} + +static int +_set_str(const char *name, PyObject **target, PyObject *src, const char *dflt) +{ + if (src == NULL) + *target = PyString_FromString(dflt); + else { + if (src == Py_None) + *target = NULL; + else if (!PyString_Check(src) +#ifdef Py_USING_UNICODE + && !PyUnicode_Check(src) +#endif + ) { + PyErr_Format(PyExc_TypeError, + "\"%s\" must be an string", name); + return -1; + } else { + Py_XDECREF(*target); + Py_INCREF(src); + *target = src; + } + } + return 0; +} + +static int +dialect_check_quoting(int quoting) { - int quoting; StyleDesc *qs = quote_styles; - if (check_delattr(v) < 0) - return -1; - if (!PyInt_Check(v)) { - PyErr_BadArgument(); - return -1; - } - quoting = PyInt_AsLong(v); for (qs = quote_styles; qs->name; qs++) { - if (qs->style == quoting) { - self->quoting = quoting; + if (qs->style == quoting) return 0; - } } - PyErr_BadArgument(); + PyErr_Format(PyExc_TypeError, "bad \"quoting\" value"); return -1; } -static struct PyMethodDef Dialect_methods[] = { - { NULL, NULL } -}; - #define D_OFF(x) offsetof(DialectObj, x) static struct PyMemberDef Dialect_memberlist[] = { - { "quotechar", T_CHAR, D_OFF(quotechar) }, - { "delimiter", T_CHAR, D_OFF(delimiter) }, - { "skipinitialspace", T_INT, D_OFF(skipinitialspace) }, - { "doublequote", T_INT, D_OFF(doublequote) }, - { "strict", T_INT, D_OFF(strict) }, + { "delimiter", T_CHAR, D_OFF(delimiter), READONLY }, + { "skipinitialspace", T_INT, D_OFF(skipinitialspace), READONLY }, + { "doublequote", T_INT, D_OFF(doublequote), READONLY }, + { "strict", T_INT, D_OFF(strict), READONLY }, { NULL } }; static PyGetSetDef Dialect_getsetlist[] = { - { "escapechar", (getter)Dialect_get_escapechar, - (setter)Dialect_set_escapechar }, - { "lineterminator", (getter)Dialect_get_lineterminator, - (setter)Dialect_set_lineterminator }, - { "quoting", (getter)Dialect_get_quoting, - (setter)Dialect_set_quoting }, - {NULL}, + { "escapechar", (getter)Dialect_get_escapechar}, + { "lineterminator", (getter)Dialect_get_lineterminator}, + { "quotechar", (getter)Dialect_get_quotechar}, + { "quoting", (getter)Dialect_get_quoting}, + {NULL}, }; static void @@ -283,107 +285,158 @@ self->ob_type->tp_free((PyObject *)self); } +/* + * Return a new reference to a dialect instance + * + * If given a string, looks up the name in our dialect registry + * If given a class, instantiate (which runs python validity checks) + * If given an instance, return a new reference to the instance + */ +static PyObject * +dialect_instantiate(PyObject *dialect) +{ + Py_INCREF(dialect); + /* If dialect is a string, look it up in our registry */ + if (PyString_Check(dialect) +#ifdef Py_USING_UNICODE + || PyUnicode_Check(dialect) +#endif + ) { + PyObject * new_dia; + new_dia = get_dialect_from_registry(dialect); + Py_DECREF(dialect); + return new_dia; + } + /* A class rather than an instance? Instantiate */ + if (PyObject_TypeCheck(dialect, &PyClass_Type)) { + PyObject * new_dia; + new_dia = PyObject_CallFunction(dialect, ""); + Py_DECREF(dialect); + return new_dia; + } + /* Make sure we finally have an instance */ + if (!PyInstance_Check(dialect)) { + PyErr_SetString(PyExc_TypeError, "dialect must be an instance"); + Py_DECREF(dialect); + return NULL; + } + return dialect; +} + +static char *dialect_kws[] = { + "dialect", + "delimiter", + "doublequote", + "escapechar", + "lineterminator", + "quotechar", + "quoting", + "skipinitialspace", + "strict", + NULL +}; + static int dialect_init(DialectObj * self, PyObject * args, PyObject * kwargs) { - PyObject *dialect = NULL, *name_obj, *value_obj; - - self->quotechar = '"'; - self->delimiter = ','; - self->escapechar = '\0'; - self->skipinitialspace = 0; - Py_XDECREF(self->lineterminator); - self->lineterminator = PyString_FromString("\r\n"); - if (self->lineterminator == NULL) + int ret = -1; + PyObject *dialect = NULL; + PyObject *delimiter = NULL; + PyObject *doublequote = NULL; + PyObject *escapechar = NULL; + PyObject *lineterminator = NULL; + PyObject *quotechar = NULL; + PyObject *quoting = NULL; + PyObject *skipinitialspace = NULL; + PyObject *strict = NULL; + + if (!PyArg_ParseTupleAndKeywords(args, kwargs, + "|OOOOOOOOO", dialect_kws, + &dialect, + &delimiter, + &doublequote, + &escapechar, + &lineterminator, + "echar, + "ing, + &skipinitialspace, + &strict)) return -1; - self->quoting = QUOTE_MINIMAL; - self->doublequote = 1; - self->strict = 0; - if (!PyArg_UnpackTuple(args, "", 0, 1, &dialect)) - return -1; - Py_XINCREF(dialect); - if (kwargs != NULL) { - PyObject * key = PyString_FromString("dialect"); - PyObject * d; - - d = PyDict_GetItem(kwargs, key); - if (d) { - Py_INCREF(d); - Py_XDECREF(dialect); - PyDict_DelItem(kwargs, key); - dialect = d; - } - Py_DECREF(key); - } - if (dialect != NULL) { - int i; - PyObject * dir_list; + Py_XINCREF(delimiter); + Py_XINCREF(doublequote); + Py_XINCREF(escapechar); + Py_XINCREF(lineterminator); + Py_XINCREF(quotechar); + Py_XINCREF(quoting); + Py_XINCREF(skipinitialspace); + Py_XINCREF(strict); + if (dialect != NULL) { + dialect = dialect_instantiate(dialect); + if (dialect == NULL) + goto err; +#define DIALECT_GETATTR(v, n) \ + if (v == NULL) \ + v = PyObject_GetAttrString(dialect, n) + + DIALECT_GETATTR(delimiter, "delimiter"); + DIALECT_GETATTR(doublequote, "doublequote"); + DIALECT_GETATTR(escapechar, "escapechar"); + DIALECT_GETATTR(lineterminator, "lineterminator"); + DIALECT_GETATTR(quotechar, "quotechar"); + DIALECT_GETATTR(quoting, "quoting"); + DIALECT_GETATTR(skipinitialspace, "skipinitialspace"); + DIALECT_GETATTR(strict, "strict"); + PyErr_Clear(); + Py_DECREF(dialect); + } - /* If dialect is a string, look it up in our registry */ - if (PyString_Check(dialect) -#ifdef Py_USING_UNICODE - || PyUnicode_Check(dialect) -#endif - ) { - PyObject * new_dia; - new_dia = get_dialect_from_registry(dialect); - Py_DECREF(dialect); - if (new_dia == NULL) - return -1; - dialect = new_dia; - } - /* A class rather than an instance? Instantiate */ - if (PyObject_TypeCheck(dialect, &PyClass_Type)) { - PyObject * new_dia; - new_dia = PyObject_CallFunction(dialect, ""); - Py_DECREF(dialect); - if (new_dia == NULL) - return -1; - dialect = new_dia; - } - /* Make sure we finally have an instance */ - if (!PyInstance_Check(dialect) || - (dir_list = PyObject_Dir(dialect)) == NULL) { - PyErr_SetString(PyExc_TypeError, - "dialect must be an instance"); - Py_DECREF(dialect); - return -1; - } - /* And extract the attributes */ - for (i = 0; i < PyList_GET_SIZE(dir_list); ++i) { - char *s; - name_obj = PyList_GET_ITEM(dir_list, i); - s = PyString_AsString(name_obj); - if (s == NULL) - return -1; - if (s[0] == '_') - continue; - value_obj = PyObject_GetAttr(dialect, name_obj); - if (value_obj) { - if (PyObject_SetAttr((PyObject *)self, - name_obj, value_obj)) { - Py_DECREF(value_obj); - Py_DECREF(dir_list); - Py_DECREF(dialect); - return -1; - } - Py_DECREF(value_obj); - } - } - Py_DECREF(dir_list); - Py_DECREF(dialect); - } - if (kwargs != NULL) { - int pos = 0; + /* check types and convert to C values */ +#define DIASET(meth, name, target, src, dflt) \ + if (meth(name, target, src, dflt)) \ + goto err + DIASET(_set_char, "delimiter", &self->delimiter, delimiter, ','); + DIASET(_set_bool, "doublequote", &self->doublequote, doublequote, 1); + DIASET(_set_char, "escapechar", &self->escapechar, escapechar, 0); + DIASET(_set_str, "lineterminator", &self->lineterminator, lineterminator, "\r\n"); + DIASET(_set_char, "quotechar", &self->quotechar, quotechar, '"'); + DIASET(_set_int, "quoting", &self->quoting, quoting, QUOTE_MINIMAL); + DIASET(_set_bool, "skipinitialspace", &self->skipinitialspace, skipinitialspace, 0); + DIASET(_set_bool, "strict", &self->strict, strict, 0); + + /* sanity check options */ + if (dialect_check_quoting(self->quoting)) + goto err; + if (self->delimiter == 0) { + PyErr_SetString(PyExc_TypeError, "delimiter must be set"); + goto err; + } + if (self->quoting != QUOTE_NONE && self->quotechar == 0) { + PyErr_SetString(PyExc_TypeError, + "quotechar must be set if quoting enabled"); + goto err; + } + if (self->lineterminator == 0) { + PyErr_SetString(PyExc_TypeError, "lineterminator must be set"); + goto err; + } + if (self->quoting == QUOTE_NONE && self->escapechar == 0) { + PyErr_SetString(PyExc_TypeError, + "escapechar must be set if quoting disabled"); + goto err; + } - while (PyDict_Next(kwargs, &pos, &name_obj, &value_obj)) { - if (PyObject_SetAttr((PyObject *)self, - name_obj, value_obj)) - return -1; - } - } - return 0; + ret = 0; +err: + Py_XDECREF(delimiter); + Py_XDECREF(doublequote); + Py_XDECREF(escapechar); + Py_XDECREF(lineterminator); + Py_XDECREF(quotechar); + Py_XDECREF(quoting); + Py_XDECREF(skipinitialspace); + Py_XDECREF(strict); + return ret; } static PyObject * @@ -433,7 +486,7 @@ 0, /* tp_weaklistoffset */ 0, /* tp_iter */ 0, /* tp_iternext */ - Dialect_methods, /* tp_methods */ + 0, /* tp_methods */ Dialect_memberlist, /* tp_members */ Dialect_getsetlist, /* tp_getset */ 0, /* tp_base */ @@ -1332,7 +1385,7 @@ return NULL; } Py_INCREF(dialect_obj); - /* A class rather than an instance? Instanciate */ + /* A class rather than an instance? Instantiate */ if (PyObject_TypeCheck(dialect_obj, &PyClass_Type)) { PyObject * new_dia; new_dia = PyObject_CallFunction(dialect_obj, ""); -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From skip at pobox.com Fri Jan 7 03:22:22 2005 From: skip at pobox.com (Skip Montanaro) Date: Thu, 6 Jan 2005 20:22:22 -0600 Subject: [Csv] CSV module brain surgery In-Reply-To: <20050107011533.EA2183C8E5@coffee.object-craft.com.au> References: <20050107011533.EA2183C8E5@coffee.object-craft.com.au> Message-ID: <16861.62046.244101.873686@montanaro.dyndns.org> Andrew> The "dialect" type in the CSV module had been bugging me for a Andrew> while - it's used to hold the C-type representation of the Andrew> parser config, and barely exposed to the user (except as an Andrew> attribute on the reader and writer objects). Andrew> There were several problems with this internal dialect type - Andrew> the primary one was that you could write to it's attributes, Andrew> which meant that cross-attribute validation was doomed. It also Andrew> reported errors terribly, typically raising something like Andrew> "invalid type for builtin" and no more information. Andrew> So, I rewrote it. .... Andrew> I'm about ready to commit this (and some minor changes to the Andrew> tests). Comments, please? As long as I can still pass a dialect class into the constructor and have it interpreted properly, I don't really what else happens. ;-) Skip From andrewm at object-craft.com.au Fri Jan 7 04:08:33 2005 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Fri, 07 Jan 2005 14:08:33 +1100 Subject: [Csv] CSV module brain surgery In-Reply-To: <16861.62046.244101.873686@montanaro.dyndns.org> References: <20050107011533.EA2183C8E5@coffee.object-craft.com.au> <16861.62046.244101.873686@montanaro.dyndns.org> Message-ID: <20050107030833.709DB3C8E5@coffee.object-craft.com.au> > Andrew> The "dialect" type in the CSV module had been bugging me for a > Andrew> while - it's used to hold the C-type representation of the > Andrew> parser config, and barely exposed to the user (except as an > Andrew> attribute on the reader and writer objects). > > Andrew> There were several problems with this internal dialect type - > Andrew> the primary one was that you could write to it's attributes, > Andrew> which meant that cross-attribute validation was doomed. It also > Andrew> reported errors terribly, typically raising something like > Andrew> "invalid type for builtin" and no more information. > > Andrew> So, I rewrote it. > >As long as I can still pass a dialect class into the constructor and have it >interpreted properly, I don't really what else happens. ;-) Yes, obviously the published interface should remain the same, although the validation done by the Dialect base class is no longer needed (the underlying dialect type does it's own validation). BTW, I've managed to fix several of the issues raised by: http://www.google.com.au/groups?selm=vsb89q1d3n5qb1%40corp.supernews.com The tricky bit is assuring myself that I haven't introduced any regressions in the process. 8-) -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Fri Jan 7 07:13:22 2005 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Fri, 07 Jan 2005 17:13:22 +1100 Subject: [Csv] csv module TODO list In-Reply-To: <20050105070643.5915B3C8E5@coffee.object-craft.com.au> References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au> Message-ID: <20050107061322.A6A563C8E5@coffee.object-craft.com.au> >There's a bunch of jobs we (CSV module maintainers) have been putting >off - attached is a list (in no particular order): [...] >Also, review comments from Jeremy Hylton, 10 Apr 2003: > > I've been reviewing extension modules looking for C types that should > participate in garbage collection. I think the csv ReaderObj and > WriterObj should participate. The ReaderObj it contains a reference to > input_iter that could be an arbitrary Python object. The iterator > object could well participate in a cycle that refers to the ReaderObj. > The WriterObj has a reference to a writeline callable, which could well > be a method of an object that also points to the WriterObj. I finally got around to looking at this, only to realise Jeremy did the work back in Apr 2003 (thanks). One question, however - the GC doco in the Python/C API seems to suggest to me that PyObject_GC_Track should be called on the newly minted object prior to returning from the initialiser (and correspondingly PyObject_GC_UnTrack should be called prior to dismantling). This isn't being done in the module as it stands. Is the module wrong, or is my understanding of the reference manual incorrect? -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Fri Jan 7 08:54:54 2005 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Fri, 07 Jan 2005 18:54:54 +1100 Subject: [Csv] Minor change to behaviour of csv module Message-ID: <20050107075454.AC1A13C8E5@coffee.object-craft.com.au> I'm considering a change to the csv module that could potentially break some obscure uses of the module (but CSV files usually quote, rather than escape, so the most common uses aren't effected). Currently, with a non-default escapechar='\\', input like: field one,field \ two,field three Returns: ["field one", "field \\\ntwo", "field three"] In the 2.5 series, I propose changing this to return: ["field one", "field \ntwo", "field three"] Is this reasonable? Is the old behaviour desirable in any way (we could add a switch to enable to new behaviour, but I feel that would only allow the confusion to continue)? BTW, some of my other changes have changed the exceptions raised when bad arguments were passed to the reader and writer factory functions - previously, the exceptions were semi-random, including TypeError, AttributeError and csv.Error - they should now almost always be TypeError (like most other argument passing errors). I can't see this being a problem, but I'm prepared to listen to arguments. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Fri Jan 7 13:06:23 2005 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Fri, 07 Jan 2005 23:06:23 +1100 Subject: [Csv] Re: [Python-Dev] Minor change to behaviour of csv module In-Reply-To: <20050107075454.AC1A13C8E5@coffee.object-craft.com.au> References: <20050107075454.AC1A13C8E5@coffee.object-craft.com.au> Message-ID: <20050107120623.EC0673C8E5@coffee.object-craft.com.au> >I'm considering a change to the csv module that could potentially break >some obscure uses of the module (but CSV files usually quote, rather >than escape, so the most common uses aren't effected). > >Currently, with a non-default escapechar='\\', input like: > > field one,field \ > two,field three > >Returns: > > ["field one", "field \\\ntwo", "field three"] > >In the 2.5 series, I propose changing this to return: > > ["field one", "field \ntwo", "field three"] > >Is this reasonable? Is the old behaviour desirable in any way (we could >add a switch to enable to new behaviour, but I feel that would only >allow the confusion to continue)? Thinking about this further, I suspect we have to retain the current behaviour, as broken as it is, as the default: it's conceivable that someone somewhere is post-processing the result to remove the backslashes, and if we fix the csv module, we'll break their code. Note that PEP-305 had nothing to say about escaping, nor does the module reference manual. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From magnus at hetland.org Fri Jan 7 14:38:17 2005 From: magnus at hetland.org (Magnus Lie Hetland) Date: Fri, 7 Jan 2005 14:38:17 +0100 Subject: [Csv] Minor change to behaviour of csv module In-Reply-To: <20050107075454.AC1A13C8E5@coffee.object-craft.com.au> References: <20050107075454.AC1A13C8E5@coffee.object-craft.com.au> Message-ID: <20050107133817.GB5503@idi.ntnu.no> Andrew McNamara : > [snip] > Currently, with a non-default escapechar='\\', input like: > > field one,field \ > two,field three > > Returns: > > ["field one", "field \\\ntwo", "field three"] > > In the 2.5 series, I propose changing this to return: > > ["field one", "field \ntwo", "field three"] IMO this is the *only* reasonable behaviour. I don't understand why the escape character should be left in; this is one of the reason why UNIX-style colon-separated values don't work with the current module. If one wanted the first version, one would (I presume) write field one,field \\\ two,field three -- Magnus Lie Hetland Fallen flower I see / Returning to its branch http://hetland.org Ah! a butterfly. [Arakida Moritake] From mcherm at mcherm.com Fri Jan 7 14:45:20 2005 From: mcherm at mcherm.com (Michael Chermside) Date: Fri, 7 Jan 2005 05:45:20 -0800 Subject: [Python-Dev] Re: [Csv] Minor change to behaviour of csv module Message-ID: <1105105520.41de927049442@mcherm.com> Andrew explains that in the CSV module, escape characters are not properly removed. Magnus writes: > IMO this is the *only* reasonable behaviour. I don't understand why > the escape character should be left in; this is one of the reason why > UNIX-style colon-separated values don't work with the current module. Andrew writes back later: > Thinking about this further, I suspect we have to retain the current > behaviour, as broken as it is, as the default: it's conceivable that > someone somewhere is post-processing the result to remove the backslashes, > and if we fix the csv module, we'll break their code. I'm with Magnus on this. No one has 4 year old code using the CSV module. The existing behavior is just simply WRONG. Sure, of course we should try to maintain backward compatibility, but surely SOME cases don't require it, right? Can't we treat this misbehavior as an outright bug? -- Michael Chermside From tim.peters at gmail.com Fri Jan 7 17:00:42 2005 From: tim.peters at gmail.com (Tim Peters) Date: Fri, 7 Jan 2005 11:00:42 -0500 Subject: [Python-Dev] Re: [Csv] csv module TODO list In-Reply-To: <20050107061322.A6A563C8E5@coffee.object-craft.com.au> References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au> <20050107061322.A6A563C8E5@coffee.object-craft.com.au> Message-ID: <1f7befae05010708005275e23d@mail.gmail.com> [Andrew McNamara] >> Also, review comments from Jeremy Hylton, 10 Apr 2003: >> >> I've been reviewing extension modules looking for C types that should >> participate in garbage collection. I think the csv ReaderObj and >> WriterObj should participate. The ReaderObj it contains a reference to >> input_iter that could be an arbitrary Python object. The iterator >> object could well participate in a cycle that refers to the ReaderObj. >> The WriterObj has a reference to a writeline callable, which could well >> be a method of an object that also points to the WriterObj. > I finally got around to looking at this, only to realise Jeremy did the > work back in Apr 2003 (thanks). One question, however - the GC doco in > the Python/C API seems to suggest to me that PyObject_GC_Track should be > called on the newly minted object prior to returning from the initialiser > (and correspondingly PyObject_GC_UnTrack should be called prior to > dismantling). This isn't being done in the module as it stands. Is the > module wrong, or is my understanding of the reference manual incorrect? The purpose of "tracking" and "untracking" is to let cyclic gc know when it (respectively) is and isn't safe to call an object's tp_traverse method. Primarily, when an object is first created at the C level, it may contain NULLs or heap trash in pointer slots, and then the object's tp_traverse could segfault if it were called while the object remained in an insane (wrt tp_traverse) state. Similarly, cleanup actions in the tp_dealloc may make a tp_traverse-sane object tp_traverse-insane, so tp_dealloc should untrack the object before that occurs. If tracking is never done, then the object effectively never participates in cyclic gc: its tp_traverse will never get called, and it will effectively act as an external root (keeping itself and everything reachable from it alive). So, yes, track it during construction, but not before all the members referenced by its tp_traverse are in a sane state. Putting the track call "at the end" of the constructor is usually best practice. tp_dealloc should untrack it then. In a debug build, that will assert-fail if the object hasn't actually been tracked. PyObject_GC_Del will untrack it for you (if it's still tracked), but it's risky to rely on that -- it's too easy to forget that Py_DECREFs on contained objects can end up executing arbitrary Python code (via __del__ and weakref callbacks, and via allowing other threads to run), which can in turn trigger a round of cyclic gc *while* your tp_dealloc is still running. So it's safest to untrack the object very early in tp_dealloc. I doubt this happens in the csv module, but an untrack/track pair should also be put around any block of method code that temporarily puts the object into a tp_traverse-insane state and that contains any C API calls that may end up triggering cyclic gc. That's very rare. From skip at pobox.com Fri Jan 7 17:09:13 2005 From: skip at pobox.com (Skip Montanaro) Date: Fri, 7 Jan 2005 10:09:13 -0600 Subject: [Csv] Minor change to behaviour of csv module In-Reply-To: <20050107075454.AC1A13C8E5@coffee.object-craft.com.au> References: <20050107075454.AC1A13C8E5@coffee.object-craft.com.au> Message-ID: <16862.46121.778915.968964@montanaro.dyndns.org> Andrew> I'm considering a change to the csv module that could Andrew> potentially break some obscure uses of the module (but CSV files Andrew> usually quote, rather than escape, so the most common uses Andrew> aren't effected). I'm with the other respondents. This looks like a bug that should be squashed. Skip From skip at pobox.com Sat Jan 8 06:03:07 2005 From: skip at pobox.com (Skip Montanaro) Date: Fri, 7 Jan 2005 23:03:07 -0600 Subject: [Csv] valgrind output Message-ID: <16863.27019.162437.881182@montanaro.dyndns.org> I compiled Python in an up-to-date cvs sandbox and ran ./python ../Lib/test/regrtest.py test_csv under control of "valgrind --tool=memcheck" with the default valgrind suppression file that comes with the Python distribution. I've attached the output. If you search for "csv" you'll see where the "test_csv" line is emitted and where valgrind finds suspicious memory activity during the test. I'm not much of a valgrind person, having only used it once or twice, so I didn't bother at this stage to dig into the output. If there's more I can do, let me know and I'll make some more runs. Skip -------------- next part -------------- A non-text attachment was scrubbed... Name: csvtest.log Type: application/octet-stream Size: 57154 bytes Desc: not available Url : http://mail.python.org/pipermail/csv/attachments/20050107/64914c09/attachment.obj From andrewm at object-craft.com.au Mon Jan 10 01:40:06 2005 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Mon, 10 Jan 2005 11:40:06 +1100 Subject: [Python-Dev] Re: [Csv] Minor change to behaviour of csv module In-Reply-To: <48F57F83-60B3-11D9-ADA4-000A95EFAE9E@aleax.it> References: <1105105520.41de927049442@mcherm.com> <48F57F83-60B3-11D9-ADA4-000A95EFAE9E@aleax.it> Message-ID: <20050110004006.88CB63C8E5@coffee.object-craft.com.au> >> Andrew explains that in the CSV module, escape characters are not >> properly removed. >> >> Magnus writes: >>> IMO this is the *only* reasonable behaviour. I don't understand why >>> the escape character should be left in; this is one of the reason why >>> UNIX-style colon-separated values don't work with the current module. >> >> Andrew writes back later: >>> Thinking about this further, I suspect we have to retain the current >>> behaviour, as broken as it is, as the default: it's conceivable that >>> someone somewhere is post-processing the result to remove the >>> backslashes, >>> and if we fix the csv module, we'll break their code. >> >> I'm with Magnus on this. No one has 4 year old code using the CSV >> module. >> The existing behavior is just simply WRONG. Sure, of course we should >> try to maintain backward compatibility, but surely SOME cases don't >> require it, right? Can't we treat this misbehavior as an outright bug? > >+1 -- the nonremoval of escape characters smells like a bug to me, too. Okay, I'm glad the community agrees (less work, less crustification). For what it's worth, it wasn't a bug so much as a misfeature. I was explicitly adding the escape character back in. The intention was to make the feature more forgiving on users who accidently set the escape character - in other words, only special (quoting, escaping, field delimiter) characters received special treatment. With the benefit of hindsight, that was an inadequately considered choice. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Mon Jan 10 04:41:09 2005 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Mon, 10 Jan 2005 14:41:09 +1100 Subject: [Csv] valgrind output In-Reply-To: <16863.27019.162437.881182@montanaro.dyndns.org> References: <16863.27019.162437.881182@montanaro.dyndns.org> Message-ID: <20050110034109.423273C889@coffee.object-craft.com.au> >I compiled Python in an up-to-date cvs sandbox and ran > > ./python ../Lib/test/regrtest.py test_csv > >under control of "valgrind --tool=memcheck" with the default valgrind >suppression file that comes with the Python distribution. I've attached the >output. If you search for "csv" you'll see where the "test_csv" line is >emitted and where valgrind finds suspicious memory activity during the >test. Did you do the other things mentioned in Misc/README.valgrind (uncomment Py_USING_MEMORY_DEBUGGER, uncomment PyObject_Free and PyObject_Realloc supressions)? When I do the things it suggests, and use the python suppression file, I get no errors. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Mon Jan 10 05:44:41 2005 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Mon, 10 Jan 2005 15:44:41 +1100 Subject: [Csv] csv module and universal newlines In-Reply-To: <20050105070643.5915B3C8E5@coffee.object-craft.com.au> References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au> Message-ID: <20050110044441.250103C889@coffee.object-craft.com.au> This item, from the TODO list, has been bugging me for a while: >* Reader and universal newlines don't interact well, reader doesn't > honour Dialect's lineterminator setting. All outstanding bug id's > (789519, 944890, 967934 and 1072404) are related to this - it's > a difficult problem and further discussion is needed. The csv parser consumes lines from an iterator, but it also has it's own idea of end-of-line conventions, which are currently only used by the writer, not the reader, which is a source of much confusion. The writer, by default, also attempts to emit a \r\n sequence, which results in more confusion unless the file is opened in binary mode. I'm looking for suggestions for how we can mitigate these problems (without breaking things for existing users). The standard file iterator includes the end-of-line characters in the returned string. One potentional solution is, then, to ignore the line chunking done by the file iterator, and logically concatenate the source lines until the csv parser's idea of lineterminator is seen - but this defeats negates the benefits of using an iterator. Another option might be to provide a new interface that relies on a file-like object being supplied. The lineterminator character would only be used with this interface, with the current interface falling back to using only \n. Rather a drastic solution. Any other ideas? -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From neal at metaslash.com Tue Jan 11 00:31:26 2005 From: neal at metaslash.com (Neal Norwitz) Date: Mon, 10 Jan 2005 18:31:26 -0500 Subject: [Csv] csv module TODO list In-Reply-To: <20050105110849.CBA843C8E5@coffee.object-craft.com.au> References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au> <20050105110849.CBA843C8E5@coffee.object-craft.com.au> Message-ID: <20050110233126.GA14363@janus.swcomplete.com> On Wed, Jan 05, 2005 at 10:08:49PM +1100, Andrew McNamara wrote: > >Also, review comments from Neal Norwitz, 22 Mar 2003 (some of these should > >already have been addressed): > > I should apologise to Neal here for not replying to him at the time. Hey, I'm impressed you got to them. :-) I completely forgot about it. > >* rather than use PyErr_BadArgument, should you use assert? > > (first example, Dialect_set_quoting, line 218) > > You mean C assert()? I don't think I'm really following you here - > where would the type of the object be checked in a way the user could > recover from? IIRC, I meant C assert(). This goes back to a discussion a long time ago about what is the preferred way to handle invalid arguments. I doubt it's important to change. > >* I think you need PyErr_NoMemory() before returning on line 768, 1178 > > The examples I looked at in the Python core didn't do this - are you sure? > (now lines 832 and 1280). Originally, they were a plain PyObject_NEW(). Now they are a PyObject_GC_New() so it seems no further change is necessary. > >* is PyString_AsString(self->dialect->lineterminator) on line 994 > > guaranteed not to return NULL? If not, it could crash by > > passing to memmove. > >* PyString_AsString() can return NULL on line 1048 and 1063, > > the result is passed to join_append() > > Looking at the PyString_AsString implementation, it looks safe (we ensure > it's really a string elsewhere)? Ok. Then it should be fine. I spot checked lineterminator and it looked ok. > >* iteratable should be iterable? (line 1088) > > Sorry, I don't know what you're getting at here? (now line 1162). Heh, I had to read that twice myself. It was a typo (assuming I wasn't completely wrong)--an extra "at", but it doesn't exist any longer. I don't think there are any changes remaining to be done from my original code review. BTW, I always try to run valgrind before a release, especially major releases. Neal From skip at pobox.com Wed Jan 12 02:59:22 2005 From: skip at pobox.com (Skip Montanaro) Date: Tue, 11 Jan 2005 19:59:22 -0600 Subject: [Csv] csv module and universal newlines In-Reply-To: <20050110044441.250103C889@coffee.object-craft.com.au> References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au> <20050110044441.250103C889@coffee.object-craft.com.au> Message-ID: <16868.33914.837771.954739@montanaro.dyndns.org> Andrew> The csv parser consumes lines from an iterator, but it also has Andrew> it's own idea of end-of-line conventions, which are currently Andrew> only used by the writer, not the reader, which is a source of Andrew> much confusion. The writer, by default, also attempts to emit a Andrew> \r\n sequence, which results in more confusion unless the file Andrew> is opened in binary mode. Andrew> I'm looking for suggestions for how we can mitigate these Andrew> problems (without breaking things for existing users). You can argue that reading csv data from/writing csv data to a file on Windows if the file isn't opened in binary mode is an error. Perhaps we should enforce that in situations where it matters. Would this be a start? terminators = {"darwin": "\r", "win32": "\r\n"} if (dialect.lineterminator != terminators.get(sys.platform, "\n") and "b" not in getattr(f, "mode", "b")): raise IOError, ("%s not opened in binary mode" % getattr(f, "name", "???")) The elements of the postulated terminators dictionary may already exist somewhere within the sys or os modules (if not, perhaps they should be added). The idea of the check is to enforce binary mode on those objects that support a mode if the desired line terminator doesn't match the platform's line terminator. Skip From andrewm at object-craft.com.au Wed Jan 12 23:55:25 2005 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Thu, 13 Jan 2005 09:55:25 +1100 Subject: [Python-Dev] Re: [Csv] csv module and universal newlines In-Reply-To: <16868.33914.837771.954739@montanaro.dyndns.org> References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au> <20050110044441.250103C889@coffee.object-craft.com.au> <16868.33914.837771.954739@montanaro.dyndns.org> Message-ID: <20050112225525.236BE3C889@coffee.object-craft.com.au> >You can argue that reading csv data from/writing csv data to a file on >Windows if the file isn't opened in binary mode is an error. Perhaps we >should enforce that in situations where it matters. Would this be a start? > > terminators = {"darwin": "\r", > "win32": "\r\n"} > > if (dialect.lineterminator != terminators.get(sys.platform, "\n") and > "b" not in getattr(f, "mode", "b")): > raise IOError, ("%s not opened in binary mode" % > getattr(f, "name", "???")) > >The elements of the postulated terminators dictionary may already exist >somewhere within the sys or os modules (if not, perhaps they should be >added). The idea of the check is to enforce binary mode on those objects >that support a mode if the desired line terminator doesn't match the >platform's line terminator. Where that falls down, I think, is where you want to read an alien file - in fact, under unix, most of the CSV files I read use \r\n for end-of-line. Also, I *really* don't like the idea of looking for a mode attribute on the supplied iterator - it feels like a layering violation. We've advertised the fact that it's an iterator, so we shouldn't be using anything but the iterator protocol. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From Jack.Jansen at cwi.nl Thu Jan 13 00:02:39 2005 From: Jack.Jansen at cwi.nl (Jack Jansen) Date: Thu, 13 Jan 2005 00:02:39 +0100 Subject: [Python-Dev] Re: [Csv] csv module and universal newlines In-Reply-To: <16868.33914.837771.954739@montanaro.dyndns.org> References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au> <20050110044441.250103C889@coffee.object-craft.com.au> <16868.33914.837771.954739@montanaro.dyndns.org> Message-ID: <0E6093F4-64EE-11D9-B7C6-000D934FF6B4@cwi.nl> On 12-jan-05, at 2:59, Skip Montanaro wrote: > terminators = {"darwin": "\r", > "win32": "\r\n"} > > if (dialect.lineterminator != terminators.get(sys.platform, "\n") > and > "b" not in getattr(f, "mode", "b")): > raise IOError, ("%s not opened in binary mode" % > getattr(f, "name", "???")) On MacOSX you really want universal newlines. CSV files produced by older software (such as AppleWorks) will have \r line terminators, but lots of other programs will have files with normal \n terminators. -- Jack Jansen, , http://www.cwi.nl/~jack If I can't dance I don't want to be part of your revolution -- Emma Goldman From skip at pobox.com Thu Jan 13 03:36:54 2005 From: skip at pobox.com (Skip Montanaro) Date: Wed, 12 Jan 2005 20:36:54 -0600 Subject: [Python-Dev] Re: [Csv] csv module and universal newlines In-Reply-To: <20050112225525.236BE3C889@coffee.object-craft.com.au> References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au> <20050110044441.250103C889@coffee.object-craft.com.au> <16868.33914.837771.954739@montanaro.dyndns.org> <20050112225525.236BE3C889@coffee.object-craft.com.au> Message-ID: <16869.57030.306263.612202@montanaro.dyndns.org> >> The idea of the check is to enforce binary mode on those objects that >> support a mode if the desired line terminator doesn't match the >> platform's line terminator. Andrew> Where that falls down, I think, is where you want to read an Andrew> alien file - in fact, under unix, most of the CSV files I read Andrew> use \r\n for end-of-line. Well, you can either require 'b' in that situation or "know" that 'b' isn't needed on Unix systems. Andrew> Also, I *really* don't like the idea of looking for a mode Andrew> attribute on the supplied iterator - it feels like a layering Andrew> violation. We've advertised the fact that it's an iterator, so Andrew> we shouldn't be using anything but the iterator protocol. The fundamental problem is that the iterator protocol on files is designed for use only with text mode (or universal newline mode, but that's just as much of a problem in this context). I think you either have to abandon the iterator protocol or peek under the iterator's covers to make sure it reads and writes in binary mode. Right now, people on windows create writers like this writer = csv.writer(open("somefile", "w")) and are confused when their csv files contain blank lines. I think the reader and writer objects have to at least emit a warning when they discover a source or destination that violates the requirements. Skip From skip at pobox.com Thu Jan 13 03:39:41 2005 From: skip at pobox.com (Skip Montanaro) Date: Wed, 12 Jan 2005 20:39:41 -0600 Subject: [Python-Dev] Re: [Csv] csv module and universal newlines In-Reply-To: <0E6093F4-64EE-11D9-B7C6-000D934FF6B4@cwi.nl> References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au> <20050110044441.250103C889@coffee.object-craft.com.au> <16868.33914.837771.954739@montanaro.dyndns.org> <0E6093F4-64EE-11D9-B7C6-000D934FF6B4@cwi.nl> Message-ID: <16869.57197.95323.656027@montanaro.dyndns.org> Jack> On MacOSX you really want universal newlines. CSV files produced Jack> by older software (such as AppleWorks) will have \r line Jack> terminators, but lots of other programs will have files with Jack> normal \n terminators. Won't work. You have to be able to write a Windows csv file on any platform. Binary mode is the only way to get that. Skip From bob at redivi.com Thu Jan 13 03:56:05 2005 From: bob at redivi.com (Bob Ippolito) Date: Wed, 12 Jan 2005 21:56:05 -0500 Subject: [Python-Dev] Re: [Csv] csv module and universal newlines In-Reply-To: <16869.57197.95323.656027@montanaro.dyndns.org> References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au> <20050110044441.250103C889@coffee.object-craft.com.au> <16868.33914.837771.954739@montanaro.dyndns.org> <0E6093F4-64EE-11D9-B7C6-000D934FF6B4@cwi.nl> <16869.57197.95323.656027@montanaro.dyndns.org> Message-ID: On Jan 12, 2005, at 21:39, Skip Montanaro wrote: > Jack> On MacOSX you really want universal newlines. CSV files > produced > Jack> by older software (such as AppleWorks) will have \r line > Jack> terminators, but lots of other programs will have files with > Jack> normal \n terminators. > > Won't work. You have to be able to write a Windows csv file on any > platform. Binary mode is the only way to get that. Isn't universal newlines only used for reading? I have had no problems using the csv module for reading files with universal newlines by opening the file myself or providing an iterator. Unicode, on the other hand, I have had problems with. -bob From andrewm at object-craft.com.au Thu Jan 13 04:21:41 2005 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Thu, 13 Jan 2005 14:21:41 +1100 Subject: [Python-Dev] Re: [Csv] csv module and universal newlines In-Reply-To: References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au> <20050110044441.250103C889@coffee.object-craft.com.au> <16868.33914.837771.954739@montanaro.dyndns.org> <0E6093F4-64EE-11D9-B7C6-000D934FF6B4@cwi.nl> <16869.57197.95323.656027@montanaro.dyndns.org> Message-ID: <20050113032141.78EB13C889@coffee.object-craft.com.au> >Isn't universal newlines only used for reading? That right. And the CSV reader has it's own version of univeral newlines anyway (from the py1.5 days). >I have had no problems using the csv module for reading files with >universal newlines by opening the file myself or providing an iterator. Neither have I, funnily enough. >Unicode, on the other hand, I have had problems with. Ah, so somebody does want it then? Good to hear. Hard to get motivated to make radical changes without feedback. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Thu Jan 13 04:49:05 2005 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Thu, 13 Jan 2005 14:49:05 +1100 Subject: [Csv] Something's fishy w/ Mac line endings... In-Reply-To: <16192.61853.960831.703844@montanaro.dyndns.org> References: <16189.38378.352326.481821@montanaro.dyndns.org> <20030818010256.31A8E3CA49@coffee.object-craft.com.au> <16192.16871.968296.398935@montanaro.dyndns.org> <20030818042033.A219F3CA49@coffee.object-craft.com.au> <16192.61853.960831.703844@montanaro.dyndns.org> Message-ID: <20050113034905.18AF43C889@coffee.object-craft.com.au> Just going back through old mail, and I came across this from last time we considered this issue: On Mon, 18 Aug 2003, Skip Montanaro wrote: >Unfortunately, I think the correct fix is to not require a NUL following >every \r or \n character encountered. I think that places the ball in your >court for the moment. Can you evaluate how hard that would be? This would actually result in us losing data, unfortunately (the data between the \r and the "end-of-string" \0 is part of the file). What's happening is that the file iterator on the mac is not recognising \r as end-of-line, and it's presumably returning the whole file as one line. I could make the csv parser treat \r as end-of-line and continue processing the string, but papering over it in the CSV module is only going to lead to worse problems (what happens if someone tries to read a 2GB file?) - better the user knows they've made an error earlier rather than later. The problem is that the error message doesn't obviously lead one to the cause. I suspect the only answer is to add a caveats, or usage section to the reference manual. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From skip at pobox.com Thu Jan 20 22:04:13 2005 From: skip at pobox.com (Skip Montanaro) Date: Thu, 20 Jan 2005 15:04:13 -0600 Subject: [Csv] csv module generating an invalid line? Message-ID: <16880.7373.8765.516395@montanaro.dyndns.org> We use the csv module in the SpamBayes project as an interchange format (*). It's generating, in part, a file like this: ... simplymaya,0,1 entitled,1,1 "subject: ",0,1 depression.,1,0 ... Note the CR inside the quoted field (third line). When I try to read that file, blammo! This example consists of a junk.csv file with just the above four lines: >>> for row in csv.reader(open("junk.csv")): ... print row ... ['simplymaya', '0', '1'] ['entitled', '1', '1'] Traceback (most recent call last): File "", line 1, in ? Error: newline inside string I think one way or the other the csv module is broken. Either it should be able to read this csv file or it should somehow generate it differently. I've confirmed this with Python from CVS (as of Jan 5 05), the 2.4 maintenance branch (as of Dec 26 04) and Python 2.3.4. Thoughts? Skip * See the sb_dbexpimp.py script: http://cvs.sourceforge.net/viewcvs.py/spambayes/spambayes/scripts/sb_dbexpimp.py?rev=1.17&view=log The above has this import: try: import csv # might get the old object craft csv module - has no reader attr if not hasattr(csv, "reader"): raise ImportError except ImportError: import spambayes.compatcsv as csv Note that I am getting the Python-sourced csv file, not the compatibility module that's part of the SpamBayes code. From skip at pobox.com Thu Jan 20 23:13:05 2005 From: skip at pobox.com (Skip Montanaro) Date: Thu, 20 Jan 2005 16:13:05 -0600 Subject: [Csv] csv module generating an invalid line? In-Reply-To: <16880.7373.8765.516395@montanaro.dyndns.org> References: <16880.7373.8765.516395@montanaro.dyndns.org> Message-ID: <16880.11505.110836.163654@montanaro.dyndns.org> Skip> ... Skip> simplymaya,0,1 Skip> entitled,1,1 Skip> "subject: Skip> ",0,1 Skip> depression.,1,0 Skip> ... Ack... Attached to prevent email corruption... Skip -------------- next part -------------- A non-text attachment was scrubbed... Name: junk.csv Type: application/octet-stream Size: 73 bytes Desc: not available Url : http://mail.python.org/pipermail/csv/attachments/20050120/6269cb94/attachment.obj From sjmachin at lexicon.net Fri Jan 21 02:17:47 2005 From: sjmachin at lexicon.net (sjmachin at lexicon.net) Date: Fri, 21 Jan 2005 12:17:47 +1100 Subject: [Csv] csv module generating an invalid line? In-Reply-To: <16880.7373.8765.516395@montanaro.dyndns.org> Message-ID: <41F0F2EB.23584.13AC6A1@localhost> On 20 Jan 2005 at 15:04, Skip Montanaro wrote: > > We use the csv module in the SpamBayes project as an interchange > format (*). It's generating, in part, a file like this: > > ... > simplymaya,0,1 > entitled,1,1 > "subject: > ",0,1 > depression.,1,0 > ... > > Note the CR inside the quoted field (third line). When I try to read > that file, blammo! This example consists of a junk.csv file with just > the above four lines: > > >>> for row in csv.reader(open("junk.csv")): > ... print row > ... > ['simplymaya', '0', '1'] > ['entitled', '1', '1'] > Traceback (most recent call last): > File "", line 1, in ? > Error: newline inside string > > I think one way or the other the csv module is broken. Either it > should be able to read this csv file or it should somehow generate it > differently. > > I've confirmed this with Python from CVS (as of Jan 5 05), the 2.4 > maintenance branch (as of Dec 26 04) and Python 2.3.4. > > Thoughts? > > Skip >>> file('junk.csv', 'rb').read() 'simplymaya,0,1\r\nentitled,1,1\r\n"subject: \r",0,1\r\ndepression.,1,0\r\n' Your junk.csv appears to be a valid csv file. The field containing the embedded \r is quoted properly. It's the _reader_ that's broken. Doubly so: (1) chucking an exception (2) calling \r a "newline". As you say, it's broken in 2.3 as well: Python 2.4 (#60, Nov 30 2004, 11:49:19) [MSC v.1310 32 bit (Intel)] on win32 >>> import csv >>> r = csv.reader(file('junk.csv','rb')) >>> contents = list(r) Traceback (most recent call last): File "", line 1, in ? _csv.Error: newline inside string >>> Python 2.3.4 (#53, May 25 2004, 21:17:02) [MSC v.1200 32 bit (Intel)] on win32 >>> import csv >>> list(csv.reader(file('junk.csv', 'rb'))) Traceback (most recent call last): File "", line 1, in ? _csv.Error: newline inside string >>> -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/csv/attachments/20050121/f0e4199a/attachment.html From skip at pobox.com Fri Jan 21 03:31:59 2005 From: skip at pobox.com (Skip Montanaro) Date: Thu, 20 Jan 2005 20:31:59 -0600 Subject: [Csv] Test case Message-ID: <16880.27039.553666.583956@montanaro.dyndns.org> Here's a test script for the problem I described earlier: #!/usr/bin/env python import csv import os row = ["sub\rject:","0","1"] writer = csv.writer(open("tmp.csv", "wb")) writer.writerow(row) del writer reader = csv.reader(open("tmp.csv", "rb")) for row in reader: try: print row except csv.Error, msg: print msg del reader os.remove("tmp.csv") I cvs up'd my Python source and confirmed that the problem is fixed there. It's a problem in 2.3 and 2.4 though. Any chance this can be fixed in time for 2.3.5? Skip From sjmachin at lexicon.net Sat Jan 22 00:06:57 2005 From: sjmachin at lexicon.net (sjmachin at lexicon.net) Date: Sat, 22 Jan 2005 10:06:57 +1100 Subject: [Csv] bugs in parsing csv? Message-ID: <41F225C1.10004.C6B5A1@localhost> I came across this example in the online version of "Programming in Lua" by Roberto Ieru.+y: >>> weird = '"hello "" hello", "",""\r\n' This is not IMHO a correctly formed CSV string. It would not be produced by csv.writer. However csv.reader accepts it without complaint: >>> import csv >>> rdr = csv.reader([weird]) >>> weird2 = rdr.next() >>> weird2 ['hello " hello', ' ""', ''] >>> wtr = csv.writer(file('weird2.csv', 'wb')) >>> wtr.writerow(weird2) >>> del wtr >>> file('weird2.csv', 'rb').read() '"hello "" hello"," """"",\r\n' # correctly quoted. Here are some more examples: >>> csv.reader([' "\r\n']).next() [' "'] >>> csv.reader([' ""\r\n']).next() [' ""'] >>> csv.reader(['x ""\r\n']).next() ['x ""'] >>> csv.reader(['x "\r\n']).next() ['x "'] Looks like we don't give a damn if the field doesn't start with a quote. In the real world this result might be OK for a field like 'Pat O"Brien' but it does indicate that the data source is probably _NOT_ quoting at all. However a not-infrequent mistake made by people generating what they call csv files is to wrap quotes around some/all fields without doubling any pre-existing quotes: >>> csv.reader(['"Pat O"Brien"\r\n']).next() ['Pat OBrien"'] <<<<<<<<<<<============== aarrbejaysus!!! Further examples of where the data source needs head alignment and csv.reader doesn't complain, giving an unfortunate result: >>> csv.reader(['spot",the",mistake"\r\n']).next() ['spot"', 'the"', 'mistake"'] >>> csv.reader(['"attempt", "at", "pretty", "formatting"\r\n']).next() ['attempt', ' "at"', ' "pretty"', ' "formatting"'] From skip at pobox.com Sat Jan 22 21:32:06 2005 From: skip at pobox.com (Skip Montanaro) Date: Sat, 22 Jan 2005 14:32:06 -0600 Subject: [Csv] List/email migration coming up for mail services host by mojam.com Message-ID: <16882.47174.580103.928772@montanaro.dyndns.org> Folks, Mojam.com has a new email server. This note is a heads up to let everybody know that I plan to migrate all email services (including mailing lists) hosted on mail.mojam.com (aka manatee.mojam.com) in the next week or two. My current preferred date is Saturday, January 29th. If that presents a problem for anyone, let me know. At that time I will make the following changes: * All POP mailboxes will be moved to the new machine * Any mailing lists of the form @manatee.mojam.com will be converted to @mail.mojam.com. Since I'm migrating from a rather old Mandrake Linux machine running Sendmail as its MTA to a new Fedora Core 2 machine running Postfix as its MTA, I expect email hosting/forwarding and mailing lists to be unavailable for a good part of the day. I will send a message out when I start the migration and another message once I've finished (hopefully just to tell you that all changes were successful). Chris D, I may need to get some phone time with you to discuss how this will affect any non-Mojam domains you are responsible for. If you have any questions, feel free to drop me a note (skip at pobox.com) or give me a call (847-971-7098), especially if your need is urgent and I've failed to respond to an email in a timely (< 1 day) fashion. -- Skip Montanaro skip at mojam.com http://www.mojam.com/ From andrewm at object-craft.com.au Mon Jan 24 00:00:28 2005 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Mon, 24 Jan 2005 10:00:28 +1100 Subject: [Csv] Test case In-Reply-To: <16880.27039.553666.583956@montanaro.dyndns.org> References: <16880.27039.553666.583956@montanaro.dyndns.org> Message-ID: <20050123230028.807C63C889@coffee.object-craft.com.au> >I cvs up'd my Python source and confirmed that the problem is fixed there. >It's a problem in 2.3 and 2.4 though. Any chance this can be fixed in time >for 2.3.5? The fix involved some radical surgery, so I doubt it's appropriate for 2.3.5 - sorry. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From skip at pobox.com Mon Jan 24 13:09:09 2005 From: skip at pobox.com (Skip Montanaro) Date: Mon, 24 Jan 2005 06:09:09 -0600 Subject: [Csv] Test case In-Reply-To: <20050123230028.807C63C889@coffee.object-craft.com.au> References: <16880.27039.553666.583956@montanaro.dyndns.org> <20050123230028.807C63C889@coffee.object-craft.com.au> Message-ID: <16884.58725.780104.976776@montanaro.dyndns.org> >> I cvs up'd my Python source and confirmed that the problem is fixed >> there. It's a problem in 2.3 and 2.4 though. Any chance this can be >> fixed in time for 2.3.5? Andrew> The fix involved some radical surgery, so I doubt it's Andrew> appropriate for 2.3.5 - sorry. Bummer. Okay, we have a workaround in SpamBayes, and it is a pretty rare corner case for that app. Since we've bumped into this before do you think it warrants a note in the docs? Skip From andrewm at object-craft.com.au Mon Jan 24 14:17:23 2005 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Tue, 25 Jan 2005 00:17:23 +1100 Subject: [Csv] Test case In-Reply-To: <16884.58725.780104.976776@montanaro.dyndns.org> References: <16880.27039.553666.583956@montanaro.dyndns.org> <20050123230028.807C63C889@coffee.object-craft.com.au> <16884.58725.780104.976776@montanaro.dyndns.org> Message-ID: <20050124131723.E63F73C889@coffee.object-craft.com.au> > >> I cvs up'd my Python source and confirmed that the problem is fixed > >> there. It's a problem in 2.3 and 2.4 though. Any chance this can be > >> fixed in time for 2.3.5? > > Andrew> The fix involved some radical surgery, so I doubt it's > Andrew> appropriate for 2.3.5 - sorry. > >Bummer. For reference, the parser was partially doing EOL processing in the line iterator code, partially in the state machine. This meant the EOL processing had no idea whether it was in a quoted field or not. In 2.5, I moved all the EOL processing into the state machine. >Okay, we have a workaround in SpamBayes, and it is a pretty rare >corner case for that app. Since we've bumped into this before do you think >it warrants a note in the docs? Possibly. There's a bunch of other stuff of a similar nature that could do with documenting. I'm inclined to think of it as a bug - while documenting bugs is a nice thing, there doesn't seem to be much of a precedent for it in the reference manual.. 8-) -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From skip at pobox.com Sat Jan 29 23:09:02 2005 From: skip at pobox.com (Skip Montanaro) Date: Sat, 29 Jan 2005 16:09:02 -0600 Subject: [Csv] testing 1 2 3 ... Message-ID: <16892.2430.53152.587960@montanaro.dyndns.org> This is a test message from skip to see if the new email server/mailman is processing mail well (or at all). Please disregard. Skip From skip at pobox.com Sun Jan 30 00:00:23 2005 From: skip at pobox.com (Skip Montanaro) Date: Sat, 29 Jan 2005 17:00:23 -0600 Subject: [Csv] List/email migration complete Message-ID: <16892.5511.804630.221374@montanaro.dyndns.org> I believe the new Mojam.com mail server is up and running. If you have saved addresses of the form somewhere at manatee.mojam.com please change them to somewhere at mojam.com or somewhere at mail.mojam.com Manatee will continue to forward email for awhile, so there's no immediate urgency. Still, I would like to shut off mail server on that machine in the next couple weeks, so tend to your housekeeping now. If you notice that any of the Mojam.comm mailing lists or email addresses you normally use seem to be a black hole, or if you have any other questions about the Mojam.com mail server, drop me a note directly (skip at pobox.com). -- Skip Montanaro skip at mojam.com http://www.mojam.com/