[pypy-issue] Issue #2997: ''.join(somestring) is buggy in presence of non-ascii characters (pypy/pypy)

Fri Apr 12 13:00:15 EDT 2019

New issue 2997: ''.join(somestring) is buggy in presence of non-ascii characters
https://bitbucket.org/pypy/pypy/issues/2997/join-somestring-is-buggy-in-presence-of

Antonio Cuni:

The following snippet prints a weird result on the latest pypy3 (nightly):

```
#-*- encoding: utf-8 -*-

def dump(s):
    print("    len():", len(s))
    print("    repr():", repr(s))
    print("    chars:", [ord(ch) for ch in s])

x = "a = 'à'"
y = ''.join(x)
print("x == y: ", x == y)
print("x:")
dump(x)
print()
print("y: ")
dump(y)
```
```
$ ./pypy3 foo.py
x == y:  True
x:
    len(): 7
    repr(): "a = 'à'"
    chars: [97, 32, 61, 32, 39, 224, 39]

y: 
    len(): 8
    repr(): "a = 'à'"
    chars: [97, 32, 61, 32, 39, 224, 39, 208]
``

Note that `x==y` even if they differ in length, and note that y has an extra char (208) which is not printed by repr(). 208 seems to be non-deterministic, so I suppose it is caused by an off-by-one error which causes someone to read past the string.

This is the ultimate cause of the `\u0000` reported by issue #2983