arraymodule questions

Sat Apr 1 05:25:22 EST 2000

Hello,

  (the questions are at the end; but first, discussion)

  I'm working on a project to store and analyze protein and
nucleotide sequences with Python.  These are often stored
as character strings.  I'm trying to find out the best storage
container for the data.  I've been experimenting with
array.array of characters instead of string because of its
in-place mutability.

For example, suppose I use a string as my primary data type,
then a function might be written as (given in another recent
post of mine)

def translate(dna, table = Translation.standard_table):
  protein = []
  for i in range(0, len(dna), 3):
    protein.append(table[dna[i:i+3]])
  return string.join(x, "")

This stores the temporary data in an [], and I don't like
that.  I would rather not have the additional conversion
back to a string.  With a character array:

def translate(dna, table = Translation.standard_table):
  dna = dna.tostring()
  protein = array.array( ("c", "") )
  for i in range(0, len(dna), 3):
    protein.append(table[dna[i:i+3]])
  return protein

This also requires a conversion, but conversion to a string
uses about 1/4th the memory as using a list, and the string
conversion is trivial since the array stores its data as a
block of characters.  (It's only 1/4 since Python special
cases single letter strings to reuse the same object.)

However, the performance of array.array is about 30% slower
than appending to a list followed by conversion to string.

Tracking it down, about 20% of the time is spent because
ins1 in arraymodule.c calls c_setitem twice.  The first
checks if the input object is a character, but with an index
meaning "don't acually set the item."  The second time
occurs after the array was resized, to actually set the
item.

What I did was add special code for append rather than
calling the generic insert function.  This always increases
the size by 1 item - optimistically assuming the setitem
will work.  Then it calls setitem and, if that fails, decrement
the size count by 1 without resizing the array back down.

(My assumption is working code most likely won't fail, so the
amortized cost of the infrequent extra realloc and few bytes
will be low.)

I got another 20% performance by preallocating an additional
64 items when the array size is increased, which reduces
the number of realloc calls.

With these changes, the two versions (string and array) of
my timing test come out rougly the same.  (I know, 30%-20%-20%
doesn't come out to be 0% - the percents aren't all based on
the same initial timings.)

I started thinking seriously about using an array for the
base character storage, but then noticed that arraymodule
doesn't expose the array object C level API.  There structure
is only defined in arraymodule.c, and not in any header file.
There will be C extensions which need access to the character
data, so I can't use an array.

So my questions are about changes for Python 1.6:
  1) would it be okay to make the arrayobject C API available?
  2) could I add the .append optimizations mentioned above?
  3) should I also add the code to preallocated space?

Some of the Python methods are commented out, such as "count".
Why is this so?  There are comments implying that if someone
needs them, they can be added.  I could do this.

I notice "extend" doesn't exist.  I could also add that.

                    Andrew Dalke
                    dalke at acm.org