StringIO proposal: add iadd

Sun Jan 29 21:20:22 EST 2006

aleax at mail.comcast.net (Alex Martelli) writes:
> > That depends on how much ram you have.  You could try a billion items.
> Let's see you try it

If you want me to try timing it with a billion items on your computer,
you'd have to set up a suitable account and open a network connection,
etc., probably not worth the trouble.  Based on examining StringIO.py,
on my current computer (512MB ram), with 100 million items, it looks
like using a bunch of writes interspersed with seeks will be much
faster than just using writes.  I wouldn't have guessed THAT.  With a
billion items it will thrash no matter what.  

> > I'd have used the array module.
> ...and would that support plain byte strings and Unicode smoothly and

Actually, I see that getvalue is supposed to raise an error if you
mix unicode with 8-bit ascii:

    The StringIO object can accept either Unicode or 8-bit strings,
    but mixing the two may take some care. If both are used, 8-bit
    strings that cannot be interpreted as 7-bit ASCII (that use the
    8th bit) will cause a UnicodeError to be raised when getvalue()
    is called.

This is another surprise, I'd have thought it could just convert to
unicode as soon as it saw a unicode string.  I think I understand the
idea.  The result is that with StringIO (Python 2.4.1),

   s = StringIO()    # ok
   s.write('\xc3')   # ok
   s.write(u'a')     # ok
   s.seek(0,2)       # raises UnicodeDecodeError

Raising the error at the second s.write doesn't seem like a big
problem.  The StringIO doc already doesn't mention that seek can raise
a Unicode exception, so it needs to be fixed either way.

> > the whole ''.join idiom revolves around the progrmamer knowing
> > some undocumented behavior of the implementation...
> 
> No more than StringIO.write "revolves around" the programmer knowing
> exactly the same thing about the optimizations in StringIO: semantics
> are guaranteed, performance characteristics are not.

I think having either one use quadratic time is bogus (something like
n log n might be ok).  

> > reliance on undocumented behavior seems totally bogus to me, but if
> 
> So I assume you won't be using StringIO.write any more, nor ANY other
> way to join sequences of strings?  Because the performance of ALL of
> them depend on such "undocumented behavior".

I'll keep using them but it means that the program's complexity
(i.e. that it's O(n) and not O(n**2)) depends on the interpreter
implementation, which is bogus.  Do you really want a language
designed by computer science geniuses to be so underspecified that
there's no way to tell whether a straightforward program's running
time is linear or quadratic?

> Besides C++'s standard library, very few languages like to pin
> themselves down by ensuring any performance guarantee;-).

I seem to remember Scheme guarantees tail recursion optimization.
This is along the same lines.  It's one thing for the docs to not want
to promise that .sort() uses at most 3.827*n*(lg(n)+14.7) comparisons
or something like that.  That's what I'd consider to be pinning down a
performance guarantee.  Promising that .sort() is O(n log n) just says
that the implementation is reasonable.  The C library's "qsort" doc
even specifies the precise algorithm, or used to.  Python's heapq
module doc similarly specifies heapq's algorithm.

Even that gets far afield.  The real objection here (about ''.join) is
that every Python user is expected to learn a weird, pervasive idiom,
but the reason for the idiom cannot be deduced from the language
reference.  That is just bizarre.

> As for performance guarantees, I don't think we have them now even for
> list.append, list.extend, dict.__getitem__, and other similarly
> fundamental methods.  I assume any such guarantees would have to
> weaselword regarding costs of memory allocation...

I think it's enough to state the amortized complexity of these
operations to within a factor of O(log(N)).  That should be easy
enough to do with the standard implementations (dicts using hashing,
etc) while still leaving the implementation pretty flexible.  That
should allow determining the running speed of a user's program to
within a factor of O(log(N)), a huge improvement over not being able
to prove anything about it.  Even without such guarantees it's enough
to say that these operations work in the obvious ways ("dicts use
hashing..."), and even without saying that, relying on the behavior
isn't so terrible, because the code that you write is the obvious code
for using operations that work in the obvious ways.  That's not the
case for ''.join, the use of which is not obvious at all.

I do see docs for the built-in hash function and __hash__ method

  http://docs.python.org/lib/built-in-funcs.html#l2h-34
  http://docs.python.org/ref/customization.html#l2h-195

indicating that dictionary lookup uses hashing.

StringIO proposal: add __iadd__

StringIO proposal: add iadd