To Tuple or to Map?

Fri Jan 25 08:41:07 EST 2002

"Clark C . Evans" <cce at clarkevans.com> wrote in message
news:mailman.1011897176.24363.python-list at python.org...
> I've got a question.  In my application I deal with many
> tables (some of them very big) where each row in the table
> has many cells.  I have a question as to how I should model
> cells.  Should they be maps or tuples?

Presumably, you mean to ask how to model rows, rather
than cells, and by maps you mean dictionaries.

> If they are maps,
> then I'd do row['colname'] to access the value and if they
> are tuples, I'd do row[nColName] where nColName is an integer.
> Since all of my rows ar relatively standard, I'm leaning
> toward using tuples and integers.  But, I must admit, using
> a bunch of maps looks a bit cleaner...  So, here is the question:
>
> (a) Does a map have significantly more run-time
>     overhead than a tuple?

Yes.  Run your own benchmarks -- it's easy.  For example:

import time

bigtuple = tuple(range(10000))
bigdict = dict([(i,i) for i in bigtuple])

def timeit(toaccess):
    start = time.clock()
    for j in range(100):
        for i in range(10000):
            toaccess[i]
    stend = time.clock()
    return stend-start

print 'tuple: %.2f'%timeit(bigtuple)
print 'dict: %.2f'%timeit(bigdict)

C:\Python22>python -O ap.py
tuple: 1.51
dict: 1.97

C:\Python22>python -O ap.py
tuple: 1.50
dict: 2.01

You see, for a million accesses a dict can take about 500
milliseconds more than a tuple; that's almost half a
microsecond on each access (on my old, slow machine).

It's "significant" in proportional terms -- probably
not in YOUR application (since it's not going to
matter in well over 90% of the cases), but only you
can answer that.

> (b) Is there any way to mark a variable (nColName)
>     as a constant so that the "byte compiler" can
>     resolve the value to an integer (instead of
>     requiring an access to the local namespace)

No (except maybe bytecodehacks, but you don't want
to get into those).

> (c) Which technique would you use?

Yet another: use a row class.
E.g.:

class row(tuple):
    names = 'whatever column names you need'.split()
    indexof = {}
    for i in range(len(names)):
        indexof[names[i]] = i
    def __init__(self, *values):
        assert len(self.names)==len(values)
        tuple.__init__(self, *values)
    def __getattr__(self, name):
        return tuple.__getitem__(self, self.indexof[name])

Yes, this has even more overhead, if you access it
by name, e.g.:
myrow = row(5,3,2,6,8)
print myrow.names

But it's nice and clear, and flexible too (you can also
use myrow[3] should you ever need to), and quite Pythonic.

In the unlikely case that this should turn out to be the
performance bottleneck making your program too slow, it
will be easy to optimize things later (after running
some profiling to confirm the culprits).  Most of your
code can stay readable and maximally nice...

Alex