[Python-Dev] RFD: how to build strings from lots of slices?

Tim Peters tim_one@email.msn.com
Sun, 27 Feb 2000 18:19:52 -0500


[/F, upon the reinvention of substring descriptors]
> ...
> a) bad memory behaviour if you slice small strings out
> of huge input strings -- which may surprise newbies.

Experts too.  Dragon has gobs of code that copies little strings via loops
in Java and C++, because Java's and MFC's descriptor-based string classes
routinely keep a megabyte string alive after you've sliced out the 3 bytes
<0.5 wink> you needed.  Last year my group finally wrote its own string
classes, to just copy the damn things.  Performance improvement was
significant (both space & time).

Boehm's "cords"/"ropes" (he's the primary author of both pkgs JC mentioned)
were specifically designed to support efficient random & repeated editing of
giant mutable strings -- agree with Guido that it's overall major loss for
pedestrian uses.  Heck, why not implement strings as giant B-trees like the
Tcl text widget does <wink>.

> b) harder to interface to underlying C libraries -- the
> current string implementation guarantees that a Python
> string is also a C string (with a trailing null).

c) For apps that use oodles of short strings, the space overhead of
maintaining descriptors exceeds that of making copies.  A buddy in Sun's
Java development group tells me Java is despised for this by Major Players
in the DB world; so don't be surprised if Java eventually drops the
descriptor idea too (or, more Java-like, introduces 5 new flavors of strings
<0.7 wink>).

So there's no pure win here.  Python's current scheme is at least
predictable, and by everyone, with finite effort.  Agree you have a
particular good but limited use it for it, though, and Greg's suggestion of
using buffer objects under the covers is almost certainly "the right" idea.