[ python-Bugs-1156259 ] [2.4 regression] seeking in codecs.reader
broken
SourceForge.net
noreply at sourceforge.net
Tue Mar 8 15:00:09 CET 2005
Bugs item #1156259, was opened at 2005-03-03 23:29
Message generated for change (Comment added) made by lemburg
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1156259&group_id=5470
Category: Extension Modules
Group: Python 2.4
Status: Open
Resolution: None
Priority: 7
Submitted By: Matthias Klose (doko)
Assigned to: Martin v. Löwis (loewis)
Summary: [2.4 regression] seeking in codecs.reader broken
Initial Comment:
[forwarded from
https://bugzilla.ubuntu.com/show_bug.cgi?id=6972 ]
This is a regression; the following script (call as
"scriptname some_textfile")
fails.
It is obvious that the file starts with a number of
random bytes from the
previous run.
Uncommenting the two #XXX lines makes the bug go away.
So does running it with
Python 2.3.5
import sys
import codecs
from random import random
data = codecs.getreader("utf-8")(open(sys.argv[1]))
df = data.read()
for t in range(30):
#XXX data.seek(0,1)
#XXX data.read()
data.seek(0,0)
dn=""
for l in data:
dn += l
if random() < 0.1: break
if not df.startswith(dn):
print "OUCH",t
print "BAD:", dn[0:100]
print "GOOD:", df[0:100]
sys.exit(1)
print "OK",len(df)
sys.exit(0)
----------------------------------------------------------------------
>Comment By: M.-A. Lemburg (lemburg)
Date: 2005-03-08 15:00
Message:
Logged In: YES
user_id=38388
Walter: the patch looks good. Please also add a doc-string
mentioning the resetting of the codec in case .seek() is used.
Whether .seek() causes a mess or not is not within the
responsibility of the codec - it's an application space
decision to make, otherwise we would have to introduce the
notion of seeking code points (rather than bytes) which I'd
rather not like to do since this can break existing
applications in many ways.
----------------------------------------------------------------------
Comment By: Matthias Urlichs (smurf)
Date: 2005-03-08 14:20
Message:
Logged In: YES
user_id=10327
Ahem -- seek(0,*whatever*) should still be allowed, whatever
else you do, please.
Reading UTF-16 from an odd position in a file isn't always
an error -- sometimes text is embedded in weird on-disk data
structures. As long as tell() returns something you can
seek() back to, nobody's got a right to complain -- file
position arithmetic in general is nonportable.
----------------------------------------------------------------------
Comment By: Walter Dörwald (doerwalter)
Date: 2005-03-04 12:44
Message:
Logged In: YES
user_id=89016
How about the following patch? Unfortunately this breaks the
codec in more obscure cases. Calling seek(0, 1) should have
now effect, but with this patch it does. Maybe calling
seek() should be prohibited? Calling a seek(1, 1) in a
UTF-16 stream completely messes up the decoded text.
----------------------------------------------------------------------
Comment By: M.-A. Lemburg (lemburg)
Date: 2005-03-04 10:56
Message:
Logged In: YES
user_id=38388
This is obviously related to the buffer logic that Walter added
to support .readline().
In order to fix the problem, a .seek() method must be
implemented
that resets the buffers whenever called (before asking the
stream
to seek to the specified stream position).
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1156259&group_id=5470
More information about the Python-bugs-list
mailing list