[issue6097] Encoded surrogate characters on command line not escaped in sys.argv

David Watson report at bugs.python.org
Sun May 24 20:03:32 CEST 2009


New submission from David Watson <baikie at users.sourceforge.net>:

The mbstowcs and mbrtwoc functions which are used for the initial
conversion of command-line arguments on Unix can return lone or
paired surrogates (e.g. \udcff for \xed\xb3\xbf in non-strict
UTF-8), and these surrogates are currently placed into sys.argv
unescaped.  This creates various problems such as strings that
cannot be re-encoded into bytes and strings that could represent
more than one byte sequence.  Examples follow using the following
script in a UTF-8 locale on Linux:

import sys
print(repr(sys.argv[1]))
print(repr(sys.argv[1].encode(sys.getfilesystemencoding(),
"surrogateescape")))


Strings that cannot be re-encoded:

$ ./python argtest.py $'\xed\xa0\x80'
'\ud800'
Traceback (most recent call last):
  File "argtest.py", line 6, in <module>
    print(repr(sys.argv[1].encode(sys.getfilesystemencoding(),
"surrogateescape")))
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in
position 0: surrogates not allowed

$ ./python argtest.py $'\xed\xb0\x80'
'\udc00'
Traceback (most recent call last):
  File "argtest.py", line 6, in <module>
    print(repr(sys.argv[1].encode(sys.getfilesystemencoding(),
"surrogateescape")))
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc00' in
position 0: surrogates not allowed


Aliasing between non-decodable bytes and encoded lone surrogates:

$ ./python argtest.py $'\xff'
'\udcff'
b'\xff'

$ ./python argtest.py $'\xed\xb3\xbf'
'\udcff'
b'\xff'


Aliasing between encoding of a non-BMP character and encoding of
its UTF-16 representation (on narrow Unicode builds):

$ ./python argtest.py $'\xf0\x90\x80\x80'
'\U00010000'
b'\xf0\x90\x80\x80'

$ ./python argtest.py $'\xed\xa0\x80\xed\xb0\x80'
'\U00010000'
b'\xf0\x90\x80\x80'


Attached is a patch to fix these problems by replacing any
decoded characters in the range 0xd800...0xdfff with the
surrogateescape encodings of their source bytes.

----------
files: escape-surrogates.diff
keywords: patch
messages: 88272
nosy: baikie
severity: normal
status: open
title: Encoded surrogate characters on command line not escaped in sys.argv
type: behavior
versions: Python 3.1, Python 3.2
Added file: http://bugs.python.org/file14054/escape-surrogates.diff

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue6097>
_______________________________________


More information about the Python-bugs-list mailing list