[Python-ideas] Create a StringBuilder class and use it everywhere

M.-A. Lemburg mal at egenix.com
Thu Aug 25 11:45:55 CEST 2011


k_bx wrote:
> Hi!
> 
> There's a certain problem right now in python that when people need to build string from pieces they really often do something like this::
> 
>     def main_pure():
>         b = u"initial value"
>         for i in xrange(30000):
>             b += u"more data"
>         return b
> 
> The bad thing about it is that new string is created every time you do +=, so it performs bad on CPython (and horrible on PyPy). If people would use, for example, list of strings it would be much better (performance)::
> 
>     def main_list_append():
>         b = [u"initial value"]
>         for i in xrange(3000000):
>             b.append(u"more data")
>         return u"".join(b)
> 
> The results are::
> 
>     kost at kost-laptop:~/tmp$ time python string_bucket_pure.py 
> 
>     real	0m7.194s
>     user	0m3.590s
>     sys	0m3.580s
>     kost at kost-laptop:~/tmp$ time python string_bucket_append.py 
> 
>     real	0m0.417s
>     user	0m0.330s
>     sys	0m0.080s
> 
> Fantastic, isn't it?
> 
> Also, now let's forget about speed and think about semantics a little: your task is: "build a string from it's pieces", or in other words "build a string from list of pieces", so from this point of view you can say that using [] and u"".join is better in semantic way.
> 
> Java has it's StringBuilder class for a long time (I'm not really into java, I've just been told about that), and what I think is that python should have it's own StringBuilder::
> 
>     class StringBuilder(object):
>         """Use it instead of doing += for building unicode strings from pieces"""
>         def __init__(self, val=u""):
>             self.val = val
>             self.appended = []
> 
>         def __iadd__(self, other):
>             self.appended.append(other)
>             return self
> 
>         def __unicode__(self):
>             self.val = u"".join((self.val, u"".join(self.appended)))
>             self.appended = []
>             return self.val
> 
> Why StringBuilder class, not just use [] + u''.join ? Well, I have two reasons for that:
> 
> 1. It has caching
> 2. You can document it, because when programmer looks at [] + u"" method he doesn't see _WHY_ is it done so, while when he sees StringBuilder class he can go ahead and read it's help().
> 
> Performance of StringBuilder is ok compared to [] + u"" (I've increased number of += from 30000 to 30000000):
> 
>     def main_bucket():
>         b = StringBuilder(u"initial value ")
>         for i in xrange(30000000):
>             b += u"more data"
>         return unicode(b)
> 
> For CPython::
> 
> 	kost at kost-laptop:~/tmp$ time python string_bucket_bucket.py 
> 
> 	real	0m12.944s
> 	user	0m11.670s
> 	sys	0m1.260s
> 
> 	kost at kost-laptop:~/tmp$ time python string_bucket_append.py 
> 
> 	real	0m3.540s
> 	user	0m2.830s
> 	sys	0m0.690s
> 
> For PyPy 1.6::
> 
> 	(pypy)kost at kost-laptop:~/tmp$ time python string_bucket_bucket.py 
> 
> 	real	0m18.593s
> 	user	0m12.930s
> 	sys	0m5.600s
> 
> 	(pypy)kost at kost-laptop:~/tmp$ time python string_bucket_append.py 
> 
> 	real	0m16.214s
> 	user	0m11.750s
> 	sys	0m4.280s
> 
> Of course, C implementation could be done to make things faster for CPython, I guess, but really, in comparision to += method it doesn't matter now. It's done to be explicit.
> 
> p.s.: also, why not use cStringIO?
> 1. it's not semantically right to create file-like string just to join multiple string pieces into one. 
> 2. if you talk about using it in your code right away -- you can see that noone still uses it because people want += (while with StringBuilder you give them +=).
> 3. it's somehow slow on pypy right now :-)

I think you should use cStringIO in your class implementation.
The list + join idiom is nice, but it has the disadvantage of
creating and keeping alive many small string objects (with all
the memory overhead and fragmentation that goes along with it).

AFAIR, the most efficient approach is using arrays:

>>> import array
>>> t = array.array('u')
>>> t.extend(u'äöü')
>>> t
array('u', u'\xe4\xf6\xfc')
>>> t.tounicode()
u'\xe4\xf6\xfc'

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Aug 25 2011)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________
2011-10-04: PyCon DE 2011, Leipzig, Germany                40 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/



More information about the Python-ideas mailing list