From skip at pobox.com Sat Aug 16 04:24:42 2003 From: skip at pobox.com (Skip Montanaro) Date: Fri, 15 Aug 2003 21:24:42 -0500 Subject: [Csv] Something's fishy w/ Mac line endings... Message-ID: <16189.38378.352326.481821@montanaro.dyndns.org> Folks, Here's a bug reported against the csv module: http://python.org/sf/789519 There seems to be a problem with what it expects to see after the \r character. It wants to see either a NUL or a \n followed by a NUL. In this case, it sees the '0' which starts the next line. I've assigned it to myself for now and I'll try to take a look at it over the weekend, but Andrew or Dave are welcome to investigate. Skip From andrewm at object-craft.com.au Mon Aug 18 03:02:56 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Mon, 18 Aug 2003 11:02:56 +1000 Subject: [Csv] Something's fishy w/ Mac line endings... In-Reply-To: Message from Skip Montanaro <16189.38378.352326.481821@montanaro.dyndns.org> References: <16189.38378.352326.481821@montanaro.dyndns.org> Message-ID: <20030818010256.31A8E3CA49@coffee.object-craft.com.au> >Here's a bug reported against the csv module: > > http://python.org/sf/789519 > >There seems to be a problem with what it expects to see after the \r >character. It wants to see either a NUL or a \n followed by a NUL. In this >case, it sees the '0' which starts the next line. I wonder if it's a unicode issue? -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From skip at pobox.com Mon Aug 18 05:03:03 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun, 17 Aug 2003 22:03:03 -0500 Subject: [Csv] Something's fishy w/ Mac line endings... In-Reply-To: <20030818010256.31A8E3CA49@coffee.object-craft.com.au> References: <16189.38378.352326.481821@montanaro.dyndns.org> <20030818010256.31A8E3CA49@coffee.object-craft.com.au> Message-ID: <16192.16871.968296.398935@montanaro.dyndns.org> >> There seems to be a problem with what it expects to see after the \r >> character. It wants to see either a NUL or a \n followed by a NUL. >> In this case, it sees the '0' which starts the next line. Andrew> I wonder if it's a unicode issue? Shouldn't be. The test case the submitter posted only uses ASCII. Looking at the problem a bit, I see this call chain: Reader_iternext -> PyIter_Next -> file_iternext -> readahead_get_line_skip readahead_get_line_skip notes the presence of \n and NUL terminates the line it returns, but not the presense of \r. I see one of two possible solutions: 1. See if readahead_get_line_skip should special-case \r when not followed by \n. I think this may be the "most correct" approach. 2. Change Reader_iternext to not rely on a NUL following the putative end-of-line character. Skip From skip at pobox.com Mon Aug 18 05:35:52 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun, 17 Aug 2003 22:35:52 -0500 Subject: [Csv] Something's fishy w/ Mac line endings... Message-ID: <16192.18840.483109.460137@montanaro.dyndns.org> I wrote: Looking at the problem a bit, I see this call chain: Reader_iternext -> PyIter_Next -> file_iternext -> readahead_get_line_skip On second thought, I think the problem may be that we're calling PyIter_Next at all. That's probably only supposed to work if the file is opened in text mode. Since we expect files to be opened in binary mode, Reader_iternext should probably be doing its own EOL detection based upon the setting of the lineterminator. That's a lot of extra labor, but may be the correct solution. Skip From andrewm at object-craft.com.au Mon Aug 18 06:20:33 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Mon, 18 Aug 2003 14:20:33 +1000 Subject: [Csv] Something's fishy w/ Mac line endings... In-Reply-To: Message from Skip Montanaro <16192.16871.968296.398935@montanaro.dyndns.org> References: <16189.38378.352326.481821@montanaro.dyndns.org> <20030818010256.31A8E3CA49@coffee.object-craft.com.au> <16192.16871.968296.398935@montanaro.dyndns.org> Message-ID: <20030818042033.A219F3CA49@coffee.object-craft.com.au> > Andrew> I wonder if it's a unicode issue? > >Shouldn't be. The test case the submitter posted only uses ASCII. However, OS-X deals with unicode natively - the standard terminal window interprets UTF-8 correctly, and, presumably, can also generate it as input to a character mode application... >readahead_get_line_skip notes the presence of \n and NUL terminates the line >it returns, but not the presense of \r. > >I see one of two possible solutions: > > 1. See if readahead_get_line_skip should special-case \r when not > followed by \n. I think this may be the "most correct" approach. I can't remember - is the EOL character a property of the Reader? We need to do a more comprehensive update for Unicode (while making the string handling 8 bit clean), but the most expedient fix is appropriate for Python 2.3.1. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Mon Aug 18 08:35:12 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Mon, 18 Aug 2003 16:35:12 +1000 Subject: [Csv] Something's fishy w/ Mac line endings... In-Reply-To: Message from Skip Montanaro <16192.18840.483109.460137@montanaro.dyndns.org> References: <16192.18840.483109.460137@montanaro.dyndns.org> Message-ID: <20030818063512.C7FAB3CA49@coffee.object-craft.com.au> >I wrote: > > Looking at the problem a bit, I see this call chain: > > Reader_iternext -> > PyIter_Next -> > file_iternext -> > readahead_get_line_skip > >On second thought, I think the problem may be that we're calling PyIter_Next >at all. That's probably only supposed to work if the file is opened in text >mode. Since we expect files to be opened in binary mode, Reader_iternext >should probably be doing its own EOL detection based upon the setting of the >lineterminator. That's a lot of extra labor, but may be the correct >solution. I think the intention was that by using PyIter_Next, we'd get the advantage of the universal EOL support in 2.3 - in which case, maybe we should drop our own EOL detection... I wonder if the user's problems go away when they open their file in text mode? -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From skip at pobox.com Mon Aug 18 17:27:41 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon, 18 Aug 2003 10:27:41 -0500 Subject: [Csv] Something's fishy w/ Mac line endings... In-Reply-To: <20030818063512.C7FAB3CA49@coffee.object-craft.com.au> References: <16192.18840.483109.460137@montanaro.dyndns.org> <20030818063512.C7FAB3CA49@coffee.object-craft.com.au> Message-ID: <16192.61549.621429.454836@montanaro.dyndns.org> Andrew> I think the intention was that by using PyIter_Next, we'd get Andrew> the advantage of the universal EOL support in 2.3 - in which Andrew> case, maybe we should drop our own EOL detection... I think we would sacrifice 2.2 compatibility and the ability to set any eol besides \n, \r\n or \r. Andrew> I wonder if the user's problems go away when they open their Andrew> file in text mode? The author's test did open the files in text mode. I added the 'b' to make the test conform to our current expectations. How hard would it be for you to modify _csv to not require a NUL after the putative EOL character? Skip From skip at pobox.com Mon Aug 18 17:32:45 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon, 18 Aug 2003 10:32:45 -0500 Subject: [Csv] Something's fishy w/ Mac line endings... In-Reply-To: <20030818042033.A219F3CA49@coffee.object-craft.com.au> References: <16189.38378.352326.481821@montanaro.dyndns.org> <20030818010256.31A8E3CA49@coffee.object-craft.com.au> <16192.16871.968296.398935@montanaro.dyndns.org> <20030818042033.A219F3CA49@coffee.object-craft.com.au> Message-ID: <16192.61853.960831.703844@montanaro.dyndns.org> Andrew> I can't remember - is the EOL character a property of the Reader? It's a property of the dialect object. Currently, I don't think we restrict the lineterminator attribute, so it would probably be valid for it to be ":", \b or '47'. Andrew> We need to do a more comprehensive update for Unicode (while Andrew> making the string handling 8 bit clean), but the most expedient Andrew> fix is appropriate for Python 2.3.1. Unfortunately, I think the correct fix is to not require a NUL following every \r or \n character encountered. I think that places the ball in your court for the moment. Can you evaluate how hard that would be? I note that ReaderObj does contain a dialect field, so you do have access to the lineterminator while reading the file. Skip From andrewm at object-craft.com.au Tue Aug 19 04:15:03 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Tue, 19 Aug 2003 12:15:03 +1000 Subject: [Csv] Something's fishy w/ Mac line endings... In-Reply-To: Message from Skip Montanaro <16192.61549.621429.454836@montanaro.dyndns.org> References: <16192.18840.483109.460137@montanaro.dyndns.org> <20030818063512.C7FAB3CA49@coffee.object-craft.com.au> <16192.61549.621429.454836@montanaro.dyndns.org> Message-ID: <20030819021503.7BFE83CA49@coffee.object-craft.com.au> > Andrew> I think the intention was that by using PyIter_Next, we'd get > Andrew> the advantage of the universal EOL support in 2.3 - in which > Andrew> case, maybe we should drop our own EOL detection... > >I think we would sacrifice 2.2 compatibility and the ability to set any eol >besides \n, \r\n or \r. It's still think it's the right thing to do: there should only be one line splitting implementation in Python. If the user has conventions that don't match, they're a) not dealing with a csv file, and b) can provide their own line iterator (which is a more general solution anyway). And as there is no separate distribution of the new csv module, the 2.2 compatibility is pretty moot (you'd have to download 2.3 and extract the module yourself). -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From skip at pobox.com Tue Aug 19 05:31:05 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon, 18 Aug 2003 22:31:05 -0500 Subject: [Csv] Something's fishy w/ Mac line endings... In-Reply-To: <20030819021503.7BFE83CA49@coffee.object-craft.com.au> References: <16192.18840.483109.460137@montanaro.dyndns.org> <20030818063512.C7FAB3CA49@coffee.object-craft.com.au> <16192.61549.621429.454836@montanaro.dyndns.org> <20030819021503.7BFE83CA49@coffee.object-craft.com.au> Message-ID: <16193.39417.902830.721032@montanaro.dyndns.org> Andrew> And as there is no separate distribution of the new csv module, Andrew> the 2.2 compatibility is pretty moot (you'd have to download 2.3 Andrew> and extract the module yourself). One of the arguments for making new modules work with the previous minor release is that they get adopted faster. If people are stuck on 2.2.x for some reason, they can still parse csv files and either not have to wait for 2.3 or change the way they do that when 2.3 is released. There's also the problem that 2.3.1 is supposed to be a bugfix release. Even though the csv module has only been around a short time and we aren't likely to break much, if any, code, changing the semantics needs to be considered carefully. The assumption here is that to fix the bug properly we have to change the module's semantics. Also, what about writing? If a user says they want Mac line endings, we have to guarantee that, right? That means for writing we still require files be opened as 'wb', not 'wU', otherwise \r would get translated into the platform's actual EOL sequence. Skip From andrewm at object-craft.com.au Tue Aug 19 05:56:50 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Tue, 19 Aug 2003 13:56:50 +1000 Subject: [Csv] Something's fishy w/ Mac line endings... In-Reply-To: Message from Skip Montanaro <16193.39417.902830.721032@montanaro.dyndns.org> References: <16192.18840.483109.460137@montanaro.dyndns.org> <20030818063512.C7FAB3CA49@coffee.object-craft.com.au> <16192.61549.621429.454836@montanaro.dyndns.org> <20030819021503.7BFE83CA49@coffee.object-craft.com.au> <16193.39417.902830.721032@montanaro.dyndns.org> Message-ID: <20030819035650.DB9C83CA4A@coffee.object-craft.com.au> > Andrew> And as there is no separate distribution of the new csv module, > Andrew> the 2.2 compatibility is pretty moot (you'd have to download 2.3 > Andrew> and extract the module yourself). > >One of the arguments for making new modules work with the previous minor >release is that they get adopted faster. If people are stuck on 2.2.x for >some reason, they can still parse csv files and either not have to wait for >2.3 or change the way they do that when 2.3 is released. > >There's also the problem that 2.3.1 is supposed to be a bugfix release. >Even though the csv module has only been around a short time and we aren't >likely to break much, if any, code, changing the semantics needs to be >considered carefully. The assumption here is that to fix the bug properly >we have to change the module's semantics. > >Also, what about writing? If a user says they want Mac line endings, we >have to guarantee that, right? That means for writing we still require >files be opened as 'wb', not 'wU', otherwise \r would get translated into >the platform's actual EOL sequence. The problem is that our end of line processing is incompatible with the use of an iterator as the source of input lines - there is no satisfactory answer that allows us to retain both. The requirement that the input file be opened in binary mode for what is obviously a text format is going to a never ending source of suprise for people using the module, and seems like a bigger wart than the one we're now facing. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From sjmachin at lexicon.net Thu Aug 21 00:59:24 2003 From: sjmachin at lexicon.net (sjmachin at lexicon.net) Date: Thu, 21 Aug 2003 08:59:24 +1000 Subject: [Csv] Something's fishy w/ Mac line endings... In-Reply-To: <20030819035650.DB9C83CA4A@coffee.object-craft.com.au> References: Message from Skip Montanaro <16193.39417.902830.721032@montanaro.dyndns.org> Message-ID: <3F4489EC.31161.BC1B6A@localhost> On 19 Aug 2003 at 13:56, Andrew McNamara wrote: > The problem is that our end of line processing is incompatible with the > use of an iterator as the source of input lines - there is no satisfactory > answer that allows us to retain both. Using an iterator as a source of what? Lines, you say? The documentation says it "iterates over lines" [what does that mean?] and that the iterator should return "strings", without saying what they should contain, how they should be terminated, etc. See examples with commentary below. > > The requirement that the input file be opened in binary mode for what > is obviously a text format is going to a never ending source of suprise > for people using the module, and seems like a bigger wart than the one > we're now facing. > I agree on the surprise factor with binary mode. It's not obvious what the purpose is. How does Excel on the Mac terminate lines in CSV files? CR or CRLF? >>> alist= ['aaa,bbb,ccc', 'ddd,eee', 'fff'] >>> [x for x in csv.reader(alist)] [['aaa', 'bbb', 'ccc'], ['ddd', 'eee'], ['fff']] # so we don't need line terminators >>> blist= ['aaa,bbb,ccc\n', 'ddd,eee\n', 'fff\n'] >>> [x for x in csv.reader(blist)] [['aaa', 'bbb', 'ccc'], ['ddd', 'eee'], ['fff']] # but if they are supplied, they are ignored >>> clist= ['aaa,bbb\nccc\n', 'ddd,eee\n', 'fff\n'] >>> [x for x in csv.reader(clist)] Traceback (most recent call last): File "", line 1, in ? _csv.Error: newline inside string # except when embedded in an unquoted string/line >>> dlist= ['aaa,"bbb\nccc",qqq\n', 'ddd,eee\n', 'fff\n'] >>> [x for x in csv.reader(dlist)] Traceback (most recent call last): File "", line 1, in ? _csv.Error: newline inside string # whoops, we really do have to pretend we are reading a file in *TEXT* mode (see next example) >>> elist= ['aaa,"bbb\n', 'ccc",qqq\n', 'ddd,eee\n', 'fff\n'] >>> [x for x in csv.reader(elist)] [['aaa', 'bbb\nccc', 'qqq'], ['ddd', 'eee'], ['fff']] # Wow, how do we explain all that to J. Random Newbie? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/csv/attachments/20030821/5b030561/attachment.htm From sdyer at dyermail.net Wed Aug 20 21:40:39 2003 From: sdyer at dyermail.net (Shawn Dyer) Date: Wed, 20 Aug 2003 14:40:39 -0500 (CDT) Subject: [Csv] PEP 305 Message-ID: <13129.204.167.177.68.1061408439.squirrel@dyermail.net> In studying the new CSV module, I find two problems, particularly in interpreting csv files used for database import/export. Currently we use our own csv parsing/writing utility, but would like to use the language supported facility if possible. 1. When reading a field with adjacent delimiters (an empty field), your code always maps that to an empty string. When interpreting DB output (at least for DB2), an empty string is a pair of quotes. An empty field represents NULL in the database and we parse that as the Python object None (same result as from an SQL query). Using the csv module as is, an empty string and None export identically. If this behavior were encoded into the dialect, we could easily modify this behavior to suit our needs. 2. The other problem for my application, is the differentiation between numeric data and strings of numbers in the csv file (this again is related to DB2 import/export files). Our needs are to map anything with quotes in the csv to a string (even if it is numeric). Anything without quotes should map to a Python numeric type (or, as mentioned above, None when adjacent delimiters appear). Of course, this would imply the possibility of a ValueError when reading a csv. Again, it seems this behavior could be parameterized out into the dialect. Possibly both items could be addressed by a map_to_python_object parameter. If you are interested in including these modifications, I can try to come up with a patch. From bdelmee at advalvas.be Thu Aug 21 20:35:17 2003 From: bdelmee at advalvas.be (=?iso-8859-1?Q?Bernard_Delm=E9e?=) Date: Thu, 21 Aug 2003 20:35:17 +0200 Subject: [Csv] How to use a non-default delimiter with DictReader? Message-ID: <003d01c36812$fa7b4af0$6702a8c0@shazam.be> Hello, I am not sure this is the right place to post, else let me know. I mean, is this address dedicated to the development of the CSV module, or to its mere usage as well? Anyway, I can't seem to be able to specify the delimiter when building a DictReader() I can do: inf = file('data.csv') rd = csv.reader( inf, delimiter=';' ) for row in rd: # ... But this is rejected: inf = file('data.csv') headers = inf.readline().split(';') rd = csv.DictReader( inf, headers, delimiter=';' ) for row in rd: # ... The DictReader constructor fails with a TypeError: _init_() got an unexpected keyword argument 'delimiter' Maybe I am missing something here? One rather convoluted workaround is the following: inf = file('data.csv') d = csv.Sniffer().sniff(s) inf.seek(0) headers = inf.readline().split(';') rd = csv.DictReader( inf, headers, dialect=d ) for row in rd: # ... If DialectReader does indeed not accept the optional "fmtparam" then at least the documentation needs fixing ;-) But then again I may just be misreading it.... TIA, Bernard. From andrewm at object-craft.com.au Fri Aug 22 02:58:49 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Fri, 22 Aug 2003 10:58:49 +1000 Subject: [Csv] PEP 305 In-Reply-To: Message from "Shawn Dyer" <13129.204.167.177.68.1061408439.squirrel@dyermail.net> References: <13129.204.167.177.68.1061408439.squirrel@dyermail.net> Message-ID: <20030822005849.3CA3B3CA4A@coffee.object-craft.com.au> >In studying the new CSV module, I find two problems, particularly in >interpreting csv files used for database import/export. Currently we use >our own csv parsing/writing utility, but would like to use the language >supported facility if possible. > >1. When reading a field with adjacent delimiters (an empty field), your >code always maps that to an empty string. When interpreting DB output (at >least for DB2), an empty string is a pair of quotes. An empty field >represents NULL in the database and we parse that as the Python object >None (same result as from an SQL query). Using the csv module as is, an >empty string and None export identically. If this behavior were encoded >into the dialect, we could easily modify this behavior to suit our needs. > >2. The other problem for my application, is the differentiation between >numeric data and strings of numbers in the csv file (this again is related >to DB2 import/export files). Our needs are to map anything with quotes in >the csv to a string (even if it is numeric). Anything without quotes >should map to a Python numeric type (or, as mentioned above, None when >adjacent delimiters appear). Of course, this would imply the possibility >of a ValueError when reading a csv. Again, it seems this behavior could be >parameterized out into the dialect. > >Possibly both items could be addressed by a map_to_python_object >parameter. You raise valid points, and it's something we argued over for some time when preparing the module for Python 2.3. I tend to agree that a switch of some sort should enable this behaviour, but I suspect it will need to be at least partially implemented in the underlying C parser (which makes it a little less trivial). As you note, there are two separate problems here - the first is that it is impossible to distinguish between an empty field and an empty string: this will need changes to the C parser. The second is that of typing the results: I'm not convinced this belongs in the csv module - the database user probably has a better idea of the required types than the csv module could ever have. A layer on top of the csv parser that takes hints from the database and casts columns to the appropriate type would be the best option - possibly a list of type converters would be passed in. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/