file iterators and codecs, a bug?

Greg Aumann greg2567 at yahoo.com.au
Sun Nov 3 22:15:18 EST 2002


Recently I figured out how to use iterators and
generators. Quite easy to use and a great improvement.


But when I refactored some of my code I came across a
discrepancy that seems like it must be a bug. If you
use the old file 

reading idiom with a codec the lines are converted to
unicode but if you use the new iterators idiom then
they retain the 

original encoding and are returned in non unicode
strings. Surely using the new "for line in file:"
idiom should give the same 

result as the old, "while 1: ...."

I came across this when using the pythonzh Chinese
codecs but the below code uses the cp1252 encoding to
illustrate the 

problem because everyone should have those codecs. The
symptoms are the same with both codecs. 

I am using python 2.2.2 on win2k. 

Is this definitely a bug, or is it an undocumented
'feature' of the codecs module?

Greg Aumann

The following code illustrates the problem:
------------------------------------------------------------------------
"""Check readline iterator using a codec."""

import codecs

fname = 'tmp.txt'
f = file(fname, 'w')
for i in range(0x82, 0x8c):
    f.write( '%x, %s\n' % (i, chr(i)))
f.close()

def test_iter():
    print '\ntesting codec iterator.'
    f = codecs.open(fname, 'r', 'cp1252')
    for line in f:
        l = line.rstrip()
        print repr(l)
        print repr(l.decode('cp1252'))
    f.close()

def test_readline():
    print '\ntesting codec readline.'
    f = codecs.open(fname, 'r', 'cp1252')
    while 1:
        line = f.readline()
        if not line:
            break
        l = line.rstrip()
        print repr(l)
        try:
            print repr(l.decode('cp1252'))
        except AttributeError, msg:
            print 'AttributeError', msg
    f.close()

test_iter()
test_readline()
------------------------------------------------------------------------
This code gives the following output:
------------------------------------------------------------------------
testing codec iterator.
'82, \x82'
u'82, \u201a'
'83, \x83'
u'83, \u0192'
'84, \x84'
u'84, \u201e'
'85, \x85'
u'85, \u2026'
'86, \x86'
u'86, \u2020'
'87, \x87'
u'87, \u2021'
'88, \x88'
u'88, \u02c6'
'89, \x89'
u'89, \u2030'
'8a, \x8a'
u'8a, \u0160'
'8b, \x8b'
u'8b, \u2039'

testing codec readline.
u'82, \u201a'
AttributeError 'unicode' object has no attribute
'decode'
u'83, \u0192'
AttributeError 'unicode' object has no attribute
'decode'
u'84, \u201e'
AttributeError 'unicode' object has no attribute
'decode'
u'85, \u2026'
AttributeError 'unicode' object has no attribute
'decode'
u'86, \u2020'
AttributeError 'unicode' object has no attribute
'decode'
u'87, \u2021'
AttributeError 'unicode' object has no attribute
'decode'
u'88, \u02c6'
AttributeError 'unicode' object has no attribute
'decode'
u'89, \u2030'
AttributeError 'unicode' object has no attribute
'decode'
u'8a, \u0160'
AttributeError 'unicode' object has no attribute
'decode'
u'8b, \u2039'
AttributeError 'unicode' object has no attribute
'decode'
------------------------------------------------------------------------


http://careers.yahoo.com.au - Yahoo! Careers
- 1,000's of jobs waiting online for you!




More information about the Python-list mailing list