String.join revisited (URGENT for 1.6)

Mon May 29 14:40:59 EDT 2000

Fredrik Lundh wrote:

> Eric Jacobs wrote:
>> Doesn't seem like it would have to. Just imagine a method in the =
> sequence
>> "class" that said:
>> =20
>>      def join(self, separator):
>>          result =3D self[0]
>>          for x in self[1:]:
>>              result =3D result + separator + x
>>          return result
>> =20
>> No new operators are necessary, unless you want join to do something
>> completely different...
> 
> like, for example, joining strings efficiently?
> 
> (If you don't see my point, try benchmarking your code vs. string.join
> on 10, 1000, 100000 substrings. Do you think the timing difference can
> be eliminated by writing your code in C?)

Surely there are two separate issues here.

1. Can we define "join" sensibly in terms of other operators?
2. Can we implement that definition efficiently?

Eric was, I think, addressing question 1, and you're rejecting
his suggestion because it doesn't address question 2.

Consider the following definitions, for instance.

def join(self, separator):
  return self._join(separator, 0, len(self))

def _join(self, separator, start, end):
  if start+1>=end: return self[start]
  if start+2>=end: return self[start]+separator+self[start+1]
  mid = (start+end)/2
  return self._join(separator, start, mid) \
         + separator \
         + self._join(separator, mid, end)

A similar definition (using a function rather than a method,
since we can't actually define new methods on lists) is slower
than string.join, but by a factor that *decreases* with
increasing n. The following table shows timings for joining
n copies of "foo" with "," as separator; the columns are
n, string.join time, my join time, and the ratio.

   1000 0.000771999359131 0.0113580226898 14.7124768376
  10000 0.00787794589996  0.129727005959  16.4671105395
 100000 0.0847760438919   1.31694090366   15.5343519608
1000000 1.14992296696    12.9766529799    11.2848019848

And my definition, just like Eric's, works on lists with
any type of element that can be combined with +.

So, even though Eric's definition is a bad one in terms
of efficiency, it validly makes the point that we can
define a "join" operation that doesn't depend on anything
more than the ability to apply "+" to our elements.

A few comments:

  - If "+" isn't associative on the elements, then Eric's
    definition and mine give different answers. That's fine;
    [...].join(sep) should simply be defined so that the
    exact arrangement of parentheses is unspecified.

  - Neither of our definitions gives sensible results
    when applied to an empty list. I suggest the following
    modification (or, rather, something equivalent to it):

        def join(self, separator):
          if not self: return 0*separator
          return self._join(separator, 0, len(self))

    This still isn't terribly nice.

  - We probably want to be able to default "separator",
    and obviously the default separator should depend
    on what's in the list. So I suggest the following
    further modification or its equivalent:

        def join(self, separator=None):
          if not self:
            if separator is None: return None # ugh
            return 0*separator
          if separator is None:
            separator = 0*self[0]
          return self._join(separator, 0, len(self))

  - Nothing we can possibly do will give a very sensible
    result if we say [].join() -- too bad. I don't think
    this is sufficient reason to abandon the idea of a
    "join" method on general lists. (It probably *is*
    reason still to have string.join .)

  - It's conceivable that going via a general __add__
    method might be slow. If we were doing this in C as
    part of Python itself, there's no reason not to
    special-case instances we want to work fast, like
    strings and Unicode strings.

-- 
Gareth McCaughan  Gareth.McCaughan at pobox.com
sig under construction