[pypy-issue] Issue #2997: ''.join(somestring) is buggy in presence of non-ascii characters (pypy/pypy)

Antonio Cuni issues-reply at bitbucket.org
Fri Apr 12 13:00:15 EDT 2019

New issue 2997: ''.join(somestring) is buggy in presence of non-ascii characters

Antonio Cuni:

The following snippet prints a weird result on the latest pypy3 (nightly):

#-*- encoding: utf-8 -*-

def dump(s):
    print("    len():", len(s))
    print("    repr():", repr(s))
    print("    chars:", [ord(ch) for ch in s])

x = "a = 'à'"
y = ''.join(x)
print("x == y: ", x == y)
print("y: ")
$ ./pypy3 foo.py
x == y:  True
    len(): 7
    repr(): "a = 'à'"
    chars: [97, 32, 61, 32, 39, 224, 39]

    len(): 8
    repr(): "a = 'à'"
    chars: [97, 32, 61, 32, 39, 224, 39, 208]

Note that `x==y` even if they differ in length, and note that y has an extra char (208) which is not printed by repr(). 208 seems to be non-deterministic, so I suppose it is caused by an off-by-one error which causes someone to read past the string.

This is the ultimate cause of the `\u0000` reported by issue #2983

More information about the pypy-issue mailing list