Why don't strings share data in Python?
Brian Quinlan
brian at sweetapp.com
Tue Apr 16 17:27:23 EDT 2002
Joe wrote:
> But, yeah, it'd be a little complicated.
Yep. I actually started to write a PEP about lazy slicing but decided
not to finish because it would very likely be rejected due to the
complexity of the implementation. Actually, I did get part way through
an implementation and it involved touching a lot of code.
PEP: XXX
Title: Lazy Slices
Version: $Revision: 1.0 $
Last-Modified: $Date: 2002/2/28 10:00:00 $
Author: brian at sweetapp.com (Brian Quinlan)
Status: Draft
Type: Standards Track
Created: 28-Feb-2002
Python-Version: 2.3
Post-History:
Abstract
This PEP describes an approach which would improve the CPU and
memory efficiency of generating large string and Unicode objects
from slices.
Rationale
Slicing is a powerful and commonly used Python idiom. However, it
is not always efficient to use because the time needed to
generate a slice is proportional to the length of the generated
object. This can force the programmer to use more complicated
techiques to avoid the expensive duplication of the existing
sequence data.
Low-level languages use address arithmatic to manipulate sub-
sequences in constant time e.g.
big_str = "Names:\nGuido\nTim\n..."
if str[6:].find("Ame") != -1): ...
Could be expressed as follows in C:
char * big_str = "Names:\nGuido\nTim\n...";
if (strstr(str + 6, "Ame") != 0) ...
The Python implementation for string and Unicode objects should
offer the same constant time efficiency when generating slices.
Implementation
The most obvious implementation would involve changing the
PyStringObject and PyUnicodeObject definitions to include a union
similar to the following:
struct t_string {
PyObject *ob_sinterned;
int ob_size;
char ob_sval[1];
};
struct t_slice {
PyObject *ob_sparent;
/* start index of the slice */
int ob_start;
};
typedef struct {
PyObject_HEAD
long ob_shash;
/* -1 indicates t_string, otherwise it indicates the
ending index of the slice */
int ob_end;
union {
struct t_string;
struct t_slice;
};
} PyStringObject;
Whenever a slice is taken of a string of Unicode object, the
new object would use the t_slice portion of the union and would
retain a reference to the original object.
The C and Python interfaces to the string and Unicode objects
would remain unchanged.
Think about these optimizations:
- don't use lazy slices if the resulting object with be "small"
(relative to the original or in absolute terms)
- there could be a built-in function that changes the slice inplace
to become a first class string (ala intern)
- if there is only one reference to a string, and it is a "small"
slice
then maybe we should reclaim the memory
Copyright
This document has been placed in the public domain.
More information about the Python-list
mailing list