Why don't strings share data in Python?

Tue Apr 16 17:27:23 EDT 2002

Joe wrote:
> But, yeah, it'd be a little complicated.

Yep. I actually started to write a PEP about lazy slicing but decided
not to finish because it would very likely be rejected due to the
complexity of the implementation. Actually, I did get part way through
an implementation and it involved touching a lot of code.

PEP: XXX
Title: Lazy Slices
Version: $Revision: 1.0 $
Last-Modified: $Date: 2002/2/28 10:00:00 $
Author: brian at sweetapp.com (Brian Quinlan)
Status: Draft
Type: Standards Track
Created: 28-Feb-2002
Python-Version: 2.3
Post-History:

Abstract

    This PEP describes an approach which would improve the CPU and
    memory efficiency of generating large string and Unicode objects
    from slices.

Rationale

    Slicing is a powerful and commonly used Python idiom. However, it
    is not always efficient to use because the time needed to 
    generate a slice is proportional to the length of the generated
    object. This can force the programmer to use more complicated
    techiques to avoid the expensive duplication of the existing
    sequence data.

    Low-level languages use address arithmatic to manipulate sub-
    sequences in constant time e.g. 

        big_str = "Names:\nGuido\nTim\n..."
        if str[6:].find("Ame") != -1): ...

        Could be expressed as follows in C:

        char * big_str = "Names:\nGuido\nTim\n...";
        if (strstr(str + 6, "Ame") != 0) ...

    The Python implementation for string and Unicode objects should
    offer the same constant time efficiency when generating slices.

Implementation

    The most obvious implementation would involve changing the
    PyStringObject and PyUnicodeObject definitions to include a union
    similar to the following:    

        struct t_string {
            PyObject *ob_sinterned;
            int ob_size;
            char ob_sval[1];
        };

        struct t_slice {
            PyObject *ob_sparent;
            /* start index of the slice */
            int ob_start;
        };

        typedef struct {
            PyObject_HEAD
            long ob_shash;
            /* -1 indicates t_string, otherwise it indicates the 
                ending index of the slice */
            int ob_end;

            union {
                struct t_string;
                struct t_slice;
            };
        } PyStringObject;

    Whenever a slice is taken of a string of Unicode object, the
    new object would use the t_slice portion of the union and would
    retain a reference to the original object.

    The C and Python interfaces to the string and Unicode objects
    would remain unchanged.

    Think about these optimizations:

    - don't use lazy slices if the resulting object with be "small"
    (relative to the original or in absolute terms)
    - there could be a built-in function that changes the slice inplace 
    to become a first class string (ala intern)
    - if there is only one reference to a string, and it is a "small"
slice
    then maybe we should reclaim the memory

Copyright

    This document has been placed in the public domain.