[New-bugs-announce] [issue24848] Warts in UTF-7 error handling

Wed Aug 12 09:36:05 CEST 2015

New submission from Serhiy Storchaka:

Trying to implement UTF-7 codec in Python I found some warts in error handling.

1. Non-ASCII bytes.

No errors:
>>> 'a€b'.encode('utf-7')
b'a+IKw-b'
>>> b'a+IKw-b'.decode('utf-7')
'a€b'

Terminating '-' at the end of the string is optional.
>>> b'a+IKw'.decode('utf-7')
'a€'

And sometimes it is optional in the middle of the string (if following char is not used in BASE64).
>>> b'a+IKw;b'.decode('utf-7')
'a€;b'

But if following char is not ASCII, it is accepted as well, and this looks as a bug.
>>> b'a+IKw\xffb'.decode('utf-7')
'a€ÿb'

In all other cases non-ASCII byte causes an error:
>>> b'a\xffb'.decode('utf-7')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/serhiy/py/cpython/Lib/encodings/utf_7.py", line 12, in decode
    return codecs.utf_7_decode(input, errors, True)
UnicodeDecodeError: 'utf7' codec can't decode byte 0xff in position 1: unexpected special character
>>> b'a\xffb'.decode('utf-7', 'replace')
'a�b'

2. Ending lone high surrogate.

Lone surrogates are silently accepted by utf-7 codec.

>>> '\ud8e4\U0001d121'.encode('utf-7')
b'+2OTYNN0h-'
>>> '\U0001d121\ud8e4'.encode('utf-7')
b'+2DTdIdjk-'
>>> b'+2OTYNN0h-'.decode('utf-7')
'\ud8e4𝄡'
>>> b'+2OTYNN0h'.decode('utf-7')
'\ud8e4𝄡'
>>> b'+2DTdIdjk-'.decode('utf-7')
'𝄡\ud8e4'

Except at the end of unterminated shift sequence:
>>> b'+2DTdIdjk'.decode('utf-7')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/serhiy/py/cpython/Lib/encodings/utf_7.py", line 12, in decode
    return codecs.utf_7_decode(input, errors, True)
UnicodeDecodeError: 'utf7' codec can't decode bytes in position 0-8: unterminated shift sequence

3. Incorrect shift sequence.

Strange behavior happens when shift sequence ends with wrong bits.
>>> b'a+IKx-b'.decode('utf-7', 'ignore')
'a€b'
>>> b'a+IKx-b'.decode('utf-7', 'replace')
'a€�b'
>>> b'a+IKx-b'.decode('utf-7', 'backslashreplace')
'a€\\x2b\\x49\\x4b\\x78\\x2db'

The decoder first decodes as much characters as can, and then pass all shift sequence (including already decoded bytes) to error handler. Not sure this is a bug, but this differs from common behavior of other decoders.

----------
components: Unicode
messages: 248450
nosy: ezio.melotti, haypo, lemburg, loewis, serhiy.storchaka
priority: normal
severity: normal
status: open
title: Warts in UTF-7 error handling
type: behavior
versions: Python 2.7, Python 3.4, Python 3.5, Python 3.6

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue24848>
_______________________________________