String.join revisited (URGENT for 1.6)
Gareth McCaughan
Gareth.McCaughan at pobox.com
Mon May 29 14:40:59 EDT 2000
Fredrik Lundh wrote:
> Eric Jacobs wrote:
>> Doesn't seem like it would have to. Just imagine a method in the =
> sequence
>> "class" that said:
>> =20
>> def join(self, separator):
>> result =3D self[0]
>> for x in self[1:]:
>> result =3D result + separator + x
>> return result
>> =20
>> No new operators are necessary, unless you want join to do something
>> completely different...
>
> like, for example, joining strings efficiently?
>
> (If you don't see my point, try benchmarking your code vs. string.join
> on 10, 1000, 100000 substrings. Do you think the timing difference can
> be eliminated by writing your code in C?)
Surely there are two separate issues here.
1. Can we define "join" sensibly in terms of other operators?
2. Can we implement that definition efficiently?
Eric was, I think, addressing question 1, and you're rejecting
his suggestion because it doesn't address question 2.
Consider the following definitions, for instance.
def join(self, separator):
return self._join(separator, 0, len(self))
def _join(self, separator, start, end):
if start+1>=end: return self[start]
if start+2>=end: return self[start]+separator+self[start+1]
mid = (start+end)/2
return self._join(separator, start, mid) \
+ separator \
+ self._join(separator, mid, end)
A similar definition (using a function rather than a method,
since we can't actually define new methods on lists) is slower
than string.join, but by a factor that *decreases* with
increasing n. The following table shows timings for joining
n copies of "foo" with "," as separator; the columns are
n, string.join time, my join time, and the ratio.
1000 0.000771999359131 0.0113580226898 14.7124768376
10000 0.00787794589996 0.129727005959 16.4671105395
100000 0.0847760438919 1.31694090366 15.5343519608
1000000 1.14992296696 12.9766529799 11.2848019848
And my definition, just like Eric's, works on lists with
any type of element that can be combined with +.
So, even though Eric's definition is a bad one in terms
of efficiency, it validly makes the point that we can
define a "join" operation that doesn't depend on anything
more than the ability to apply "+" to our elements.
A few comments:
- If "+" isn't associative on the elements, then Eric's
definition and mine give different answers. That's fine;
[...].join(sep) should simply be defined so that the
exact arrangement of parentheses is unspecified.
- Neither of our definitions gives sensible results
when applied to an empty list. I suggest the following
modification (or, rather, something equivalent to it):
def join(self, separator):
if not self: return 0*separator
return self._join(separator, 0, len(self))
This still isn't terribly nice.
- We probably want to be able to default "separator",
and obviously the default separator should depend
on what's in the list. So I suggest the following
further modification or its equivalent:
def join(self, separator=None):
if not self:
if separator is None: return None # ugh
return 0*separator
if separator is None:
separator = 0*self[0]
return self._join(separator, 0, len(self))
- Nothing we can possibly do will give a very sensible
result if we say [].join() -- too bad. I don't think
this is sufficient reason to abandon the idea of a
"join" method on general lists. (It probably *is*
reason still to have string.join .)
- It's conceivable that going via a general __add__
method might be slow. If we were doing this in C as
part of Python itself, there's no reason not to
special-case instances we want to work fast, like
strings and Unicode strings.
--
Gareth McCaughan Gareth.McCaughan at pobox.com
sig under construction
More information about the Python-list
mailing list