[ python-Bugs-1156259 ] [2.4 regression] seeking in codecs.reader broken

Tue Mar 8 15:00:09 CET 2005

Bugs item #1156259, was opened at 2005-03-03 23:29
Message generated for change (Comment added) made by lemburg
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1156259&group_id=5470

Category: Extension Modules
Group: Python 2.4
Status: Open
Resolution: None
Priority: 7
Submitted By: Matthias Klose (doko)
Assigned to: Martin v. Löwis (loewis)
Summary: [2.4 regression] seeking in codecs.reader broken

Initial Comment:
[forwarded from
https://bugzilla.ubuntu.com/show_bug.cgi?id=6972 ]

This is a regression; the following script (call as
"scriptname some_textfile")
fails.
It is obvious that the file starts with a number of
random bytes from the
previous run.

Uncommenting the two #XXX lines makes the bug go away.
So does running it with
Python 2.3.5

import sys
import codecs
from random import random

data = codecs.getreader("utf-8")(open(sys.argv[1]))
df = data.read()
for t in range(30):
    #XXX data.seek(0,1)
    #XXX data.read()
    data.seek(0,0)
    dn=""
    for l in data:
        dn += l
        if random() < 0.1: break
    if not df.startswith(dn):
        print "OUCH",t
        print "BAD:", dn[0:100]
        print "GOOD:", df[0:100]
        sys.exit(1)

print "OK",len(df)
sys.exit(0)

----------------------------------------------------------------------

>Comment By: M.-A. Lemburg (lemburg)
Date: 2005-03-08 15:00

Message:
Logged In: YES 
user_id=38388

Walter: the patch looks good. Please also add a doc-string
mentioning the resetting of the codec in case .seek() is used.

Whether .seek() causes a mess or not is not within the
responsibility of the codec - it's an application space
decision to make, otherwise we would have to introduce the
notion of seeking code points (rather than bytes) which I'd
rather not like to do since this can break existing
applications in many ways.

----------------------------------------------------------------------

Comment By: Matthias Urlichs (smurf)
Date: 2005-03-08 14:20

Message:
Logged In: YES 
user_id=10327

Ahem -- seek(0,*whatever*) should still be allowed, whatever
else you do, please.

Reading UTF-16 from an odd position in a file isn't always
an error -- sometimes text is embedded in weird on-disk data
structures. As long as tell() returns something you can
seek() back to, nobody's got a right to complain -- file
position arithmetic in general is nonportable.

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2005-03-04 12:44

Message:
Logged In: YES 
user_id=89016

How about the following patch? Unfortunately this breaks the
codec in more obscure cases. Calling seek(0, 1) should have
now effect, but with this patch it does. Maybe calling
seek() should be prohibited? Calling a seek(1, 1) in a
UTF-16 stream completely messes up the decoded text.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2005-03-04 10:56

Message:
Logged In: YES 
user_id=38388

This is obviously related to the buffer logic that Walter added
to support .readline().

In order to fix the problem, a .seek() method must be
implemented
that resets the buffers whenever called (before asking the
stream
to seek to the specified stream position).

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1156259&group_id=5470