[Python-Dev] RFC: Add a new builtin strarray type to Python?
Victor Stinner
victor.stinner at haypocalc.com
Sat Oct 1 19:17:56 CEST 2011
Hi,
Since the integration of the PEP 393, str += str is not more super-fast (but
just fast). For example, adding a single character to a string has to copy all
characters to a new string. I suppose that performances of a lot of
applications manipulating text may be affected by this issue, especially text
templating libraries.
io.StringIO has also been changed to store characters as Py_UCS4 (4 bytes)
instead of Py_UNICODE (2 or 4 bytes). This class doesn't benefit from the new
PEP 393.
I propose to add a new builtin type to Python to improve both issues (cpu and
memory): *strarray*. This type would have the same API than str, except:
* has append() and extend() methods
* methods results are strarray instead of str
I'm writing this email to ask you if this type solves a real issue, or if we
can just prove the super-fast str.join(list of str).
--
strarray is similar to bytearray, but different: strarray('abc')[0] is 'a', not
97, and strarray can store any Unicode character (not only integers in range
0-255).
I wrote a quick and dirty implementation in Python just to be able to play
with the API, and to have an idea of the quantity of work required to
implement it:
https://bitbucket.org/haypo/misc/src/tip/python/strarray.py
(Some methods are untested: see the included TODO list.)
--
Implement strarray in C is not trivial and it would be easier to implement it
in 3 steps:
(a) Use Py_UCS4 array
(b) The array type depends on the content: best memory footprint, as the PEP
393
(c) Use strarray to implement a new io.StringIO
Or we can just stop after step (a).
--
strarray API has to be discussed.
Most bytearray methods return a new object in most cases. I don't understand
why, it's not efficient. I don't know if we can do in-place operations for
strarray methods having the same name than bytearray methods (which are not
in-place methods).
str has some more methods that bytes and bytearary don't have, like format. We
may do in-place operation for these methods.
Victor
More information about the Python-Dev
mailing list