[Python-ideas] Create a StringBuilder class and use it everywhere

k_bx k.bx at ya.ru
Thu Aug 25 12:38:44 CEST 2011


25.08.2011, 12:28, "k_bx" <k.bx at ya.ru>:
> Hi!
>
> There's a certain problem right now in python that when people need to build string from pieces they really often do something like this::
>
>     def main_pure():
>         b = u"initial value"
>         for i in xrange(30000):
>             b += u"more data"
>         return b
>
> The bad thing about it is that new string is created every time you do +=, so it performs bad on CPython (and horrible on PyPy). If people would use, for example, list of strings it would be much better (performance)::
>
>     def main_list_append():
>         b = [u"initial value"]
>         for i in xrange(3000000):
>             b.append(u"more data")
>         return u"".join(b)
>
> The results are::
>
>     kost at kost-laptop:~/tmp$ time python string_bucket_pure.py
>
>     real 0m7.194s
>     user 0m3.590s
>     sys 0m3.580s
>     kost at kost-laptop:~/tmp$ time python string_bucket_append.py
>
>     real 0m0.417s
>     user 0m0.330s
>     sys 0m0.080s
>
> Fantastic, isn't it?
>
> Also, now let's forget about speed and think about semantics a little: your task is: "build a string from it's pieces", or in other words "build a string from list of pieces", so from this point of view you can say that using [] and u"".join is better in semantic way.
>
> Java has it's StringBuilder class for a long time (I'm not really into java, I've just been told about that), and what I think is that python should have it's own StringBuilder::
>
>     class StringBuilder(object):
>         """Use it instead of doing += for building unicode strings from pieces"""
>         def __init__(self, val=u""):
>             self.val = val
>             self.appended = []
>
>         def __iadd__(self, other):
>             self.appended.append(other)
>             return self
>
>         def __unicode__(self):
>             self.val = u"".join((self.val, u"".join(self.appended)))
>             self.appended = []
>             return self.val
>
> Why StringBuilder class, not just use [] + u''.join ? Well, I have two reasons for that:
>
> 1. It has caching
> 2. You can document it, because when programmer looks at [] + u"" method he doesn't see _WHY_ is it done so, while when he sees StringBuilder class he can go ahead and read it's help().
>
> Performance of StringBuilder is ok compared to [] + u"" (I've increased number of += from 30000 to 30000000):
>
>     def main_bucket():
>         b = StringBuilder(u"initial value ")
>         for i in xrange(30000000):
>             b += u"more data"
>         return unicode(b)
>
> For CPython::
>
>         kost at kost-laptop:~/tmp$ time python string_bucket_bucket.py
>
>         real 0m12.944s
>         user 0m11.670s
>         sys 0m1.260s
>
>         kost at kost-laptop:~/tmp$ time python string_bucket_append.py
>
>         real 0m3.540s
>         user 0m2.830s
>         sys 0m0.690s
>
> For PyPy 1.6::
>
>         (pypy)kost at kost-laptop:~/tmp$ time python string_bucket_bucket.py
>
>         real 0m18.593s
>         user 0m12.930s
>         sys 0m5.600s
>
>         (pypy)kost at kost-laptop:~/tmp$ time python string_bucket_append.py
>
>         real 0m16.214s
>         user 0m11.750s
>         sys 0m4.280s
>
> Of course, C implementation could be done to make things faster for CPython, I guess, but really, in comparision to += method it doesn't matter now. It's done to be explicit.
>
> p.s.: also, why not use cStringIO?
> 1. it's not semantically right to create file-like string just to join multiple string pieces into one.
> 2. if you talk about using it in your code right away -- you can see that noone still uses it because people want += (while with StringBuilder you give them +=).
> 3. it's somehow slow on pypy right now :-)
>
> Thanks.

Oh, and also, I really like how Python had it's MutableString class since forever, but deprecated in python 3.



More information about the Python-ideas mailing list