[ python-Bugs-1331062 ] utf 7 codec broken

Thu Oct 20 00:34:47 CEST 2005

Bugs item #1331062, was opened at 2005-10-19 10:23
Message generated for change (Comment added) made by lemburg
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1331062&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Unicode
Group: Python 2.4
>Status: Closed
>Resolution: Fixed
Priority: 5
Submitted By: Ralf Schmitt (titty)
Assigned to: M.-A. Lemburg (lemburg)
Summary: utf 7 codec broken

Initial Comment:
the following code doesn't work as expected:

ralf at stronzo:~$ cat t.py
#! /usr/bin/env python

s = 'Auguste and Louis Lumi\xe8re'
print repr(s)
u1 = s.decode('utf7')
print 'from utf7: %d %r' % (len(u1), u1)
u2 = u'Auguste and Louis Lumi\xe8re'
print '       u2: %d %r' % (len(u2), u2)

print 'u1==u2', u1==u2

e1 = u1.encode('utf8')
e2 = u2.encode('utf8')

print 'e1=%r' % e1
print 'e2=%r' % e2

unicode(e2, 'utf8')
unicode(e1, 'utf8')
ralf at stronzo:~$ python t.py
'Auguste and Louis Lumi\xe8re'
from utf7: 25 u'Auguste and Louis Lumi\xe8re'
       u2: 25 u'Auguste and Louis Lumi\xe8re'
u1==u2 False
e1='Auguste and Louis Lumi\xff\xbf\xbf\xa8re'
e2='Auguste and Louis Lumi\xc3\xa8re'
Traceback (most recent call last):
  File "t.py", line 19, in ?
    unicode(e1, 'utf8')
  File "/usr/local/lib/python2.4/encodings/utf_8.py",
line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff
in position 22: unexpected code byte

----------------------------------------------------------------------

>Comment By: M.-A. Lemburg (lemburg)
Date: 2005-10-20 00:34

Message:
Logged In: YES 
user_id=38388

Fixed in CVS:

Checking in unicodeobject.c;
/cvsroot/python/python/dist/src/Objects/unicodeobject.c,v 
<--  unicodeobject.c
new revision: 2.233; previous revision: 2.232
done

I've marked this as backport candidate.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2005-10-19 14:09

Message:
Logged In: YES 
user_id=38388

I can confirm this: using a UCS4 build Python accepts the
malformed UTF-7 string.

I'll have a look at Sjoerd's suggestion.

----------------------------------------------------------------------

Comment By: Ralf Schmitt (titty)
Date: 2005-10-19 13:29

Message:
Logged In: YES 
user_id=17929

The problem *disappears* on freebsd if I configure *without*
--enable-unicode=ucs4.
Guess this is also what the debian people are using and not
a compiler bug, since freebsd uses gcc 2.95 and debian 4.0.x.

----------------------------------------------------------------------

Comment By: Sjoerd Mullender (sjoerd)
Date: 2005-10-19 13:17

Message:
Logged In: YES 
user_id=43607

The definition of SPECIAL in unicodeobject.c is wrong.  It
tests a character for > 127, but when characters are signed
and Py_UNICODE expands to a signed type, this doesn't do
what was intended.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2005-10-19 13:07

Message:
Logged In: YES 
user_id=38388

I was testing on SuSE Linux 9.2.

Sounds like a compiler bug. Could you try compiling with
optimization switched off on FreeBSD ?

Thanks.

----------------------------------------------------------------------

Comment By: Ralf Schmitt (titty)
Date: 2005-10-19 12:58

Message:
Logged In: YES 
user_id=17929

On Debian testing and Freebsd 4.11 using Python 2.4.2
'\xe8'.decode('utf7') succeeds...
Using the windows version I also get that error.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2005-10-19 12:30

Message:
Logged In: YES 
user_id=38388

Hmm, running Python 2.4.2 I get:

>>> s = 'Auguste and Louis Lumi\xe8re'
>>> print repr(s)
'Auguste and Louis Lumi\xe8re'
>>> u1 = s.decode('utf7')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeDecodeError: 'utf7' codec can't decode bytes in
position 0-22: unexpected special character

Which looks correct as UTF-7 may not contain characters
having the hig bit set.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1331062&group_id=5470