[issue1943] improved allocation of PyUnicode objects

Mon Feb 1 11:39:19 CET 2010

Marc-Andre Lemburg <mal at egenix.com> added the comment:

Antoine Pitrou wrote:
> 
> Antoine Pitrou <pitrou at free.fr> added the comment:
> 
>> I find that the null termination for 8-bit strings makes low-level
>> parsing operations (e.g., parsing a numeric string) safer and easier:
> 
> Not to mention faster. The new IO library makes use of it (for newline
> detection), on both bytestrings and unicode strings.

I'd consider that a bug. Esp. the IO lib should be 8-bit clean
in the sense that it doesn't add any special meaning to NUL
characters or code points.

Besides, using a for-loop with a counter is both safer and faster
than checking each an every character for NUL.

Just think of what can happen if you have buggy code that overwrites
the NUL byte in some corner case situation and then use the assumption
of having the NUL byte as terminator - a classical buffer overrun.

If you're lucky, you get a segfault. If not, you end up with
data corruption or manipulation of data which could lead to
unwanted code execution.

The Python Unicode API deliberately tries to always use the combination
of a Py_UNICODE* pointer and a length integer to avoid such issues.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue1943>
_______________________________________