Issue 1175396: codecs.readline sometimes removes newline chars

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/41794

classification

Title:	codecs.readline sometimes removes newline chars
Type:		Stage:
Components:	Library (Lib)	Versions:	Python 2.4

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	doerwalter, glchapman, irmen, mmm, vrmeyer
Priority:	normal	Keywords:

Created on 2005-04-02 15:14 by irmen, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
bug.py	irmen, 2005-04-02 15:14	test program that shows the problem
testfile1.txt	irmen, 2005-04-02 15:15	inputfile with CR/LF line endings (needed by bug.py)

Messages (14)
msg24858 - (view)	Author: Irmen de Jong (irmen)	Date: 2005-04-02 15:14
In Python 2.4.1 i observed a new bug in codecs.readline, it seems that with certain inputs it removes newline characters from the end of the line.... Probably related to bug #1076985 (Incorrect behaviour of StreamReader.readline leads to crash) and bug #1098990 codec readline() splits lines apart (both with status closed) so I'm assigning this to Walter. See the attached files that demonstrate the problem. Reproduced with Python 2.4.1 on windows XP and on Linux. The problem does not occur with Python 2.4. (btw, it seems bug #1076985 was fixed in python 2.4.1, but the other one (#1098990) not? )
msg24859 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2005-04-04 22:01
Logged In: YES user_id=89016 Checked in a fix as: Lib/codecs.py 1.42/1.43 Lib/test/test_codecs.py 1.22 Lib/codecs.py 1.35.2.6 Lib/test/test_codecs.py 1.15.2.4 Are you really sure, that the fix for #1098990 is not in 2.4.1? According to the tracker for #1098990 the fix was in lib/codecs.py revision 1.35.2.2 and revision 1.35.2.3 is the one that got the r241c1 tag, so the fix should be in 2.4.1.
msg24860 - (view)	Author: Greg Chapman (glchapman)	Date: 2005-04-09 22:47
Logged In: YES user_id=86307 Sorry to comment on a closed report, but perhaps this fix should not be limited only to cases where size is None. Today, I ran into a spurious syntax error when trying to import pythondoc (from http://effbot.org/downloads/pythondoc-2.1b3-20050325.zip). It turned out a \r was ending up in what looked to the parser like the middle of a line, presumably because a \n was dropped. Anyway, I applied the referenced patch to 2.4.1, except I left out the "size is None" condition, since I knew the tokenizer passes in a size param. With that change pythondoc import successfully. (Also, I just ran the test suite and nothing broke.) Since the size parameter is already documented as being passed to StreamReader.read (in codecs.py -- the HTML documentation needs to be updated), and since StreamReader.read says size is an approximate maximum, perhaps it's OK to read one extra byte.
msg24861 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2005-04-11 21:42
Logged In: YES user_id=89016 OK, I'm reopening to bug report. I didn't manage to install pythondoc. cElementTree complains about: No such file or directory: './pyconfig.h'. Can you provide a simple Python file that fails when imported?
msg24862 - (view)	Author: Michal Rydlo (mmm)	Date: 2005-04-14 22:04
Logged In: YES user_id=65460 foo2.py from #1163244 fails to import. Not being expert in Python internals I hope it is due to this bug.
msg24863 - (view)	Author: Greg Chapman (glchapman)	Date: 2005-04-14 23:25
Logged In: YES user_id=86307 I think the foo2.py from 1163244 is probably the same bug; at any rate, the reason for it is that a \r is at the beginning of the last line when read in by decoding_fgets. I have simpler test file which shows the bug which I'll email to Walter (you basically just have to get a \r as the last character in the block read by StreamReader, so that atcr will be true). The problem is caused by StreamReader.readline doing: if self.atcr and data.startswith(u"\n"): data = data[1:] since the tokenizer relies on '\n' as the line break character, but it will never see the '\n' removed by the above. FWIW (not much), I think the 2.4 StreamReader.readline actually made more sense than the current code, although a few changes would seem useful (see below). I don't think it is particularly useful to treat the size parameter as a fixed maximum number of bytes to read, since the number of bytes read has no fixed relationship to the number of decoded unicode characters (and also, in the case of the tokenizer, no fixed relationship to the number of bytes of encoded utf8). Also, given the current code, the size parameter is effectively ignored if there is a charbuffer: if you have 5 characters sitting in the charbuffer and use a size of 0x1FF, you only get back the 5 characters, even if they do not end in a linebreak. For the tokenizer, this means an unnecessary PyMem_RESIZE and an extra call to decoding_readline roughly every BUFSIZ bytes in the file (since the tokenizer assumes failure to fetch a complete line means its buffer is too small, whereas in fact it was caused by an incomplete line being stored in the StreamReader's charbuffer). As to changes from 2.4, if the unicode object were to add a findlinebreak method which returns the index of the first character for which Py_UNICODE_ISLINEBREAK is true, readline could use that instead of find("\n"). If it used such a method, readline would also need to explicitly handle a "\r\n" sequence, including a potential read(1) if a '\r' appears at the end of the data (in the case where size is not None). Of course, one problem with that idea is it requires a new method, which may not be allowed until 2.5, and the 2.4.1 behavior definitely needs to be fixed some way. (Interestingly, it looks to me like sre has everything necessary for searching for unicode linebreaks except syntax with which to express the idea in a pattern (maybe I'm missing something, but I can't find a way to get a compiled pattern to emit CATEGORY_UNI_LINEBREAK).)
msg24864 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2005-04-15 18:16
Logged In: YES user_id=89016 The current readline() is implemented so that even if the complete line ending (i.e. '\r\n') can't be read within size bytes, at least something that might work as a line end (i.e. '\r') is available at the end of the string, so that the user knowns that the line is complete. The atcr members makes sure that the '\n' that is read next isn't misinterpreted as another line. Unfortunately the import mechanisn doesn't work that way: It demands a '\n' as line terminator and will continue reading if it only finds the '\r'. This means that the '\n' will be skipped and the import mechanisn treats those two lines as one. IMHO the best solution would be the read the extra character even if size is None, as glchapman suggested, so the above situation would really only happen if the last character from the stream is '\r'. I think the tokenizer should be able to survive this. What it didn't survive in 2.4 was that readline() tried to get it's hand on a complete line even if the length of the line was way bigger than the size passed in. If we remove the "size is None" restriction from the current code, then I think we should remove the atcr logic too, because it could only happen under exceptional circumstances and the old handling might be better for those users that don't recognize '\r' as a line end. But in any case the tokenizer should be fixed to be able to handle any line length returned from readline(). I'd really like to get a review by Martin v. Löwis of glchapman's patch #1101726. Of course the simplest solution would be: "If you want a complete line, don't pass a size parameter to readline()". I don't know if it would make sense to change the PEP263 machinery to support this. The other problem is if readline() returns data from the charbuffer. I don't really see how this can be fixed. We might call read() with a size parameter even if there are characters in the charbuffer, but what happens if the user calls readline(size=100000) first and then readline(size=10)? The first readline() call might put a very long string into the charbuffer and then the second call will return a unicode string whose length is in no way correlated to the size parameter. (This happens even even with the current code of course, so the message should be: Don't trust the size parameter, it only tells the underlying stream how many bytes to read (approximately), it's doesn't tell you anything about the amount of data returned.). This was different in 2.3.x, because the readline() method of the underlying stream was used, which handled this properly, but only worked if an encoded linefeed was recognizable as a normal 8bit '\n' linefeed by the underlying readline() (i.e. the UTF-16 encodings didn't work).
msg24865 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2005-04-21 21:55
Logged In: YES user_id=89016 OK, I've checked in the following: Lib/codecs.py 1.44 Lib/test/test_codecs.py 1.23 Lib/codecs.py 1.35.2.7 Lib/test/test_codecs.py 1.15.2.5 with the following changes as suggested by glchapman: If a chunk read has a trailing "\r", read one more character even if the user has passed a size parameter. Remove the atcr logic. This should fix most of the problems. There are three open issues: 1) How do we handle the problem of a truncated line, if the data comes from the charbuffer instead of being read from the stream? 2) How do we handle error reporting? The buffering code might read more data than the current line. If this data has a decoding error the caller might report a wrong line number. (See bug #1178484) 3) Fixing the tokenizer.
msg24866 - (view)	Author: Greg Chapman (glchapman)	Date: 2005-04-23 16:46
Logged In: YES user_id=86307 > 1) How do we handle the problem of a truncated line, if the > data comes from the charbuffer instead of being read from > the stream? My suggestion is to make the top of the loop look like: while True: havecharbuffer = bool(self.charbuffer) And then the break condition (when no line break found) should be: # we didn't get anything or this was our only try if not data or (size is not None and not havecharbuffer): (too many negatives in that). Anyway, the idea is that, if size specified, there will be at most one read of the underlying stream (using the size). So if you enter the loop with a charbuffer, and that charbuffer does not have a line break, then you redo the loop once (the second time it will break, because havecharbuffer will be False). Also, not sure about this, but should the size parameter default to -1 (to keep it in sync with read)? As to issue 2, it looks like it should be possible to get the line number right, because the UnicodeDecodeError exception object has all the necessary information in it (including the original string). I think this should be done by fp_readl (in tokenizer.c). By the way, using a findlinebreak function (using sre) turns out to be slower than splitting/joining when no size is specified (which I assume will be the overwhelmingly most common case), so that turned out to be a bad idea on my part.
msg24867 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2005-05-16 14:20
Logged In: YES user_id=89016 > > 1) How do we handle the problem of a truncated line, if the > > data comes from the charbuffer instead of being read from > > > the stream? > > > > My suggestion is to make the top of the loop look like: > > > > while True: > > havecharbuffer = bool(self.charbuffer) > > > > And then the break condition (when no line break found) > > should be: > > > > # we didn't get anything or this was our only try > > if not data or (size is not None and not havecharbuffer): > > > > (too many negatives in that). Anyway, the idea is that, if > > size specified, there will be at most one read of the > > underlying stream (using the size). So if you enter the > > loop with a charbuffer, and that charbuffer does not have a > > line break, then you redo the loop once (the second time it > > will break, because havecharbuffer will be False). This makes sense. However, with the current state of the tokenizer this might be to dangerous, because the resulting line might be twice as long. So fixing the tokenizer should be the first step. BTW, your patch fixes the problems with the fix for #1178484, so I think I'm going to apply the patch in the next days, if there are no objections. > > Also, not sure about this, but should the size parameter > > default to -1 (to keep it in sync with read)? None seems to be a better default from an API viewpoint, but -1 is better for "C compatibility". > > As to issue 2, it looks like it should be possible to get > > the line number right, because the UnicodeDecodeError > > exception object has all the necessary information in it > > (including the original string). I think this should be > > done by fp_readl (in tokenizer.c). The patch for #1178484 fixes this. Combined with this patch I think we're in good shape. > > By the way, using a findlinebreak function (using sre) turns > > out to be slower than splitting/joining when no size is > > specified (which I assume will be the overwhelmingly most > > common case), so that turned out to be a bad idea on my part. Coding this on the C level and using Py_UNICODE_ISLINEBREAK() should be the fastest version, but I don't know if this is worth it.
msg24868 - (view)	Author: Alan Meyer (vrmeyer)	Date: 2005-07-12 18:07
Logged In: YES user_id=338015 Hello doerwalter, Our thanks to you and to glchapman for working on this bug. I think the project I am working on may have run into this bug while attempting to use the python COM client wrappers for ADO on Win32. We saw the problem in Python 2.4.1, but not 2.4 or 2.3.5. Do you have any projection for when the fix will make it into the stable distribution? Our production is running on 2.3.5. If it looks like a long while before the fix is distributed, we may upgrade to 2.4.0. Otherwise we'll stick with 2.3.5 and wait. Thanks again. Alan
msg24869 - (view)	Author: Greg Chapman (glchapman)	Date: 2005-07-12 19:53
Logged In: YES user_id=86307 Build 204 of pywin32 has a workaround for bug 1163244 which should also avoid this bug: it leaves out the encoding notation in generated COM wrapper files (so loading will not involve the codecs module). If your wrapper files don't need extended characters (I don't think ADO does), you might want to give that a shot.
msg24870 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2005-10-09 19:45
Logged In: YES user_id=89016 2.4.2 is out the door, so I'm closing this bug as fixed.
msg24871 - (view)	Author: Irmen de Jong (irmen)	Date: 2005-10-09 23:25
Logged In: YES user_id=129426 Confirmed fixed with Python 2.4.2 on mac os x. But the status is still 'open' although resolution is 'fixed'...

History
Date	User	Action	Args
2022-04-11 14:56:10	admin	set	github: 41794
2005-04-02 15:14:44	irmen	create