[Python-Dev] Disabling string interning for null and single-char causes segfaults

Sat Mar 2 21:55:07 CET 2013

On Sat, Mar 2, 2013 at 4:08 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:
> On Sat, Mar 2, 2013 at 1:24 AM, Stefan Bucur <stefan.bucur at gmail.com> wrote:
>> Hi,
>>
>> I'm working on an automated bug finding tool that I'm trying to apply on the
>> Python interpreter code (version 2.7.3). Because of early prototype
>> limitations, I needed to disable string interning in stringobject.c. More
>> precisely, I modified the PyString_FromStringAndSize and PyString_FromString
>> to no longer check for the null and single-char cases, and create instead a
>> new string every time (I can send the patch if needed).
>>
>> However, after applying this modification, when running "make test" I get a
>> segfault in the test___all__ test case.
>>
>> Before digging deeper into the issue, I wanted to ask here if there are any
>> implicit assumptions about string identity and interning throughout the
>> interpreter implementation. For instance, are two single-char strings having
>> the same content supposed to be identical objects?
>>
>> I'm assuming that it's either this, or some refcount bug in the interpreter
>> that manifests only when certain strings are no longer interned and thus
>> have a higher chance to get low refcount values.
>
> In theory, interning is supposed to be a pure optimisation, but it
> wouldn't surprise me if there are cases that assume the described
> strings are always interned (especially the null string case). Our
> test suite would never detect such bugs, as we never disable the
> interning.

I understand. In this case, I'll further investigate the issue, and
see what exactly is the cause of the crash.

>
> Whether or not we're interested in fixing such bugs would depend on
> the size of the patches needed to address them. From our point of
> view, such bugs are purely theoretical (as the assumption is always
> valid in an unpatched CPython build), so if the problem is too hard to
> diagnose or fix, we're more likely to declare that interning of at
> least those kinds of string values is required for correctness when
> creating modified versions of CPython.
>
> I'm not sure what kind of analyser you are writing, but if it relates
> to the CPython C API, you may be interested in
> https://gcc-python-plugin.readthedocs.org/en/latest/cpychecker.html

That's quite a neat tool, I didn't know about it! I guess that would
have saved me many hours of debugging obscure refcount bugs in my own
Python extensions :)

In any case, my analysis tool aims to find bugs in Python programs,
not in the CPython implementation itself. It works by performing
symbolic execution [1] on the Python interpreter, while it is
executing the target Python program. This means that the Python
interpreter memory space contains symbolic expressions (i.e.,
mathematical formulas over the program input) instead of "concrete"
values.

The interned strings are pesky for symbolic execution because the
PyObject* pointer allocated when creating an interned string depends
on the string contents, e.g., if the contents are already interned,
the old pointer is returned, otherwise a new object is created. So the
pointer itself becomes "symbolic", i.e., dependant on the input data,
which makes the analysis much more complicated.

Stefan

[1] http://en.wikipedia.org/wiki/Symbolic_execution