[pypy-issue] Issue #2618: incorrect "surrogatepass" encoding with pypy3.5-5.8.0 (pypy/pypy)

Cosimo Lupo issues-reply at bitbucket.org
Wed Jul 26 13:12:46 EDT 2017


New issue 2618: incorrect "surrogatepass" encoding with pypy3.5-5.8.0
https://bitbucket.org/pypy/pypy/issues/2618/incorrect-surrogatepass-encoding-with

Cosimo Lupo:

Hello,

I'm getting different encodings between CPython 3.5.3 and pypy3.5-5.8.0 when the input string contains surrogate escapes.
 
When I roundtrip the string 'Carrot \ud83e\udd55' through "utf_16_be" encoding with errors="surrogatepass", in CPython I correctly get 'Carrot \U0001f955'

```
Python 3.5.3 (default, Jul 18 2017, 13:04:39)
[GCC 4.2.1 Compatible Apple LLVM 8.1.0 (clang-802.0.42)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> 'Carrot \ud83e\udd55'.encode('utf_16_be', errors='surrogatepass')
b'\x00C\x00a\x00r\x00r\x00o\x00t\x00 \xd8>\xddU'
>>> 'Carrot \ud83e\udd55'.encode('utf_16_be', errors='surrogatepass').decode('utf_16_be')
'Carrot \U0001f955'
```

However, with PyPy3.5 5.8.0, same input and code, I get this:
```
Python 3.5.3 (a37ecfe5f142bc971a86d17305cc5d1d70abec64, Jul 25 2017, 16:48:07)
[PyPy 5.8.0-beta0 with GCC 4.2.1 Compatible Apple LLVM 8.1.0 (clang-802.0.42)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
And now for something completely different: ``the future has just begun''
>>>> 'Carrot \ud83e\udd55'.encode('utf_16_be', errors='surrogatepass')
b'\x00C\x00a\x00r\x00r\x00o\x00t\x00 >\xd8U\xdd'
>>>> 'Carrot \ud83e\udd55'.encode('utf_16_be', errors='surrogatepass').decode('utf_16_be')
'Carrot 㻘嗝'
```

I'm on macOS 10.12.6, I compiled pypy3 from source, using latest GCC 7.1.0 from homebrew.
I haven't had the chance to try on Linux yet.

Thanks for your help.




More information about the pypy-issue mailing list