[issue34403] test_utf8_mode.test_cmd_line() fails on HP-UX due to false assumptions

Mon Aug 27 16:58:46 EDT 2018

Michael Felt <aixtools at felt.demon.nl> added the comment:

On 27/08/2018 15:22, Michael Osipov wrote:
> Michael Osipov <1983-01-06 at gmx.net> added the comment:
>
> So I changed the test code to:
>
> diff --git a/Lib/test/test_utf8_mode.py b/Lib/test/test_utf8_mode.py
> index 26e2e13ec5..d9f8a3ed8b 100644
> --- a/Lib/test/test_utf8_mode.py
> +++ b/Lib/test/test_utf8_mode.py
> @@ -208,7 +208,7 @@ class UTF8ModeTests(unittest.TestCase):
>      def test_cmd_line(self):
>          arg = 'h\xe9\u20ac'.encode('utf-8')
>          arg_utf8 = arg.decode('utf-8')
> -        arg_ascii = arg.decode('ascii', 'surrogateescape')
> +        arg_ascii = arg.decode('roman8', 'surrogateescape')
>          code = 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))'
>
>          def check(utf8_opt, expected, **kw):
>
> and the output is:
> ======================================================================
> FAIL: test_cmd_line (test.test_utf8_mode.UTF8ModeTests)
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File "/var/osipovmi/cpython/Lib/test/test_utf8_mode.py", line 224, in test_cmd_line
>     check('utf8=0', [c_arg], LC_ALL='C')
>   File "/var/osipovmi/cpython/Lib/test/test_utf8_mode.py", line 217, in check
>     self.assertEqual(args, ascii(expected), out)
> AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\xfb\\u02cb\\xe3\\x82\\u02dc']"
> - ['h\xc3\xa9\xe2\x82\xac']
> + ['h\xfb\u02cb\xe3\x82\u02dc']
>  : roman8:['h\xc3\xa9\xe2\x82\xac']
>
> I still don't understand that.
Something I found helpful was to change:

        check('utf8=0', [c_arg], LC_ALL='C')

to
        check('utf8=0', [c_arg], LC_ALL='C', failure=True )

This also fails, but it shows what is being executed.

Further, my 'understanding' is that ascii(whatever) is much smarter than
whatever.decode('ascii', ...) does. Also, ascii() tends to use the \x
shorthand, while decode('ascii', 'surrogateescape') uses the \udc prefix.

And, while you might still consider it a 'bug', did you try using c_arg
= arg.decode('iso-88859-1') ?

Michael (F)

>
> I believe that surrogate escape only works for ASCII and nothing else. If so, this test must be skipped on HP-UX and AIX.
>
> ----------
>
> _______________________________________
> Python tracker <report at bugs.python.org>
> <https://bugs.python.org/issue34403>
> _______________________________________
>

----------

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue34403>
_______________________________________