in place input

Sun Oct 12 22:47:42 EDT 2003

[Oops.  I apologize that this message has turned into a long list of
'timeit' results and that I've been too timid to really draw a
conclusion.  Good luck!]

Without knowing exactly how you need to operate on your text, you might
find the 'mmap' module and/or the 'buffer' object useful, in addition to
the 'array' object which was already mentioned.

In general, Python may do lots of allocations and deallocations for
every statement--for instance, the expression '123+45' allocates a new
integer object.  For the sake of performance, a special freelist of
partially initialized int-sized objects is maintained.  Usually a new int
object can be created without reaching a system malloc() call, and free()
is not called whenever an int object's reference count drops to zero.

Python's so-called "pymalloc" or "obmalloc" maintains a set of pools for
objects up to size 256, which function similarly to the special int
allocator.  The largest size class corresponds to a string with
something like 236 characters. 

What does all this mean?

$ timeit -s 's = "x" * 110' 't = s * 2'
100000 loops, best of 3: 2.27 usec per loop
$ timeit -s 's = 110' 't = s * 2'
1000000 loops, best of 3: 1.16 usec per loop

So, creating a 220-character string by using the '*' operator takes just
about twice as long as creating an int using '*'.  And to my surprise
it doesn't seem to hurt that much to go above the obmalloc cutoff size:

$ timeit -s 's = "x" * 250' 't = s * 2'
100000 loops, best of 3: 2.8 usec per loop

Timings are unchanged when using a unicode string that occupies the same
number of bytes:

$ timeit -s 's = u"x" * 125' 't = s * 2'
100000 loops, best of 3: 2.8 usec per loop

When doing .read(), the time it takes is significantly longer than
creating strings using *:
$ timeit -s 'f = open("/dev/zero")' 's = f.read(200)'
100000 loops, best of 3: 6.11 usec per loop
... so maybe the slowness of reading is not actually taken up by
allocating the strings after all?

Prebinding the read method or using os.read both improve the timing:
$ timeit -s 'f = open("/dev/zero", "rb"); read = f.read' 's = read(200)'
100000 loops, best of 3: 4.11 usec per loop
$ timeit -s 'f = open("/dev/zero", "rb"); fd = f.fileno(); from os import read' 's = read(fd, 200)'
100000 loops, best of 3: 3.9 usec per loop

Using mmap and/or buffer gives very good timings, but I'm not sure if
the code does what I think it's doing:
timeit -s 'f = open("/bin/sh", "rb"); import mmap; m =
mmap.mmap(f.fileno(), 4096, prot=mmap.PROT_READ)' 'm[:200]'
1000000 loops, best of 3: 1.95 usec per loop

And so does using the mmap's 'readline' method, especially considering
that I could only figure out how to benchmark it with a seek at each
iteration:
$ timeit -s 'f = open("/etc/fstab", "rb"); import mmap, os; m = mmap.mmap(f.fileno(), os.fstat(f.fileno()).st_size, prot=mmap.PROT_READ); seek=m.seek; readline=m.readline' 'seek(0); readline()'
100000 loops, best of 3: 3.52 usec per loop

(3.52 usec to read a ~80 byte string means that the program would
process 22MB/second, according to the trusty "units" program, on this
lowly 650MHz laptop)

Here's the behavior reading my OS's dictionary of english words.
Obviously the lines are short, but it looks like the seek() call
practically doubles the runtime:
$ timeit -n 45000 -s 'f = open("/usr/share/dict/words", "rb"); import mmap, os; m = mmap.mmap(f.fileno(), os.fstat(f.fileno()).st_size, prot=mmap.PROT_READ); seek=m.seek; readline=m.readline' 'readline()'
45000 loops, best of 3: 1.87 usec per loop
And the largest file (by # bytes) in the Python 2.3.1 distribution:
$ timeit -n 9700 -s 'f = open("Mac/Modules/qt/_Qtmodule.c", "rb"); import mmap, os; m = mmap.mmap(f.fileno(), os.fstat(f.fileno()).st_size, prot=mmap.PROT_READ); seek=m.seek; readline=m.readline' 'readline()'
9700 loops, best of 3: 1.96 usec per loop

If these figures are not good enough for you, and you remain convinced
that the culprit is the time taken to allocate string objects, you might
modify your version of Python to keep a set of special string freelists
for various size ranges (for instance, strings <16 bytes, strings 16-255
bytes, and strings 256-4076 bytes might be suitable if you decide on
3 lists and want to do objects up to the CPU's page size).  If the
complexity is small and you can demonstrate performance improvements
for a range of Python programs (not just your program), it might even
be a good candidate for inclusion in a future version of core Python.

Jeff