[Python-Dev] RFC: Add a new builtin strarray type to Python?

Victor Stinner victor.stinner at haypocalc.com
Sat Oct 1 19:17:56 CEST 2011


Hi,

Since the integration of the PEP 393, str += str is not more super-fast (but 
just fast). For example, adding a single character to a string has to copy all 
characters to a new string. I suppose that performances of a lot of 
applications manipulating text may be affected by this issue, especially text 
templating libraries.

io.StringIO has also been changed to store characters as Py_UCS4 (4 bytes) 
instead of Py_UNICODE (2 or 4 bytes). This class doesn't benefit from the new 
PEP 393.

I propose to add a new builtin type to Python to improve both issues (cpu and 
memory): *strarray*. This type would have the same API than str, except:

 * has append() and extend() methods
 * methods results are strarray instead of str

I'm writing this email to ask you if this type solves a real issue, or if we 
can just prove the super-fast str.join(list of str).

--

strarray is similar to bytearray, but different: strarray('abc')[0] is 'a', not 
97, and strarray can store any Unicode character (not only integers in range 
0-255).

I wrote a quick and dirty implementation in Python just to be able to play 
with the API, and to have an idea of the quantity of work required to 
implement it:

https://bitbucket.org/haypo/misc/src/tip/python/strarray.py

(Some methods are untested: see the included TODO list.)

--

Implement strarray in C is not trivial and it would be easier to implement it 
in 3 steps:

 (a) Use Py_UCS4 array
 (b) The array type depends on the content: best memory footprint, as the PEP 
393
 (c) Use strarray to implement a new io.StringIO

Or we can just stop after step (a).

--

strarray API has to be discussed.

Most bytearray methods return a new object in most cases. I don't understand 
why, it's not efficient. I don't know if we can do in-place operations for 
strarray methods having the same name than bytearray methods (which are not 
in-place methods).

str has some more methods that bytes and bytearary don't have, like format. We 
may do in-place operation for these methods.

Victor


More information about the Python-Dev mailing list