Making a dict from two lists/tuples

Alex Martelli aleaxit at yahoo.com
Fri May 25 09:05:45 EDT 2001


"Carlos Ribeiro" <cribeiro at mail.inet.com.br> wrote in message
news:mailman.990732696.22272.python-list at python.org...
> At 06:51 24/05/01 -0400, Andreas Jung wrote:
> >dict={}
> >
> >for i in range(len(keys)):
> >     dict[keys[i]] = values[i]
>
> I think that this kind of construct has been beaten to the death many
> times, but it keeps being used, mainly because it looks similar to the
> 'for' statement of C or Pascal.

...or maybe because it CAN be faster than most sensible alternatives
in some case...?-)


> This kind of loop should be avoided in Python for several reasons:
>
> - Explicit for loops in Python are generally much slower than implicit
> loops, such the ones provided by map(), zip() or list comprehensions. *In

An excellent point, except that it happens to be false.  Please
re-read this thread: I did several performance measurements and
posted numbers (apparently some few others also measured, but I
saw no other numbers posted).  List comprehensions don't let you
build a dictionary (only a list), zip loses BIG-TIME when the
len(keys) is biggish, and the only approach that happens to be
a tad faster than this obvious and elementary one on my box is
a tricky-indeed map(dict.setdefault, keys, values) [throwing away
the result list...].


> - Range(len(keys)) builds a list. This takes some time, and uses up a
> potentially big amount of memory. This is not an issue in some cases;
> however, if the body of the 'for' statement is comprised of simple
> statements, the performance hit may be noticeable.

...in which case you may want to try xrange() instead, of course.

Here's a followup version of my timing-script, with better
organization and more variation:
::: file po.py
import time

def timit(funz):
    d1 = {}
    start = time.clock()
    funz(d1, keys, vals)
    stend = time.clock()
    print "  %s: %s"%(funz.__name__,stend-start)

def mappy(d1, keys, vals):
    for k, v in map(None, keys, vals):
        d1[k] = v

def zippy(d1, keys, vals):
    for k, v in zip(keys, vals):
        d1[k] = v

def fancy(d1, keys, vals):
    _ = map(d1.setdefault, keys, vals)

def emile(d1, keys, vals):
    for i in range(len(keys)):
        d1[keys[i]] = vals[i]

def xmile(d1, keys, vals):
    for i in xrange(len(keys)):
        d1[keys[i]] = vals[i]

funcs = mappy, zippy, fancy, emile, xmile

print "int(100000) keys and vals:"
keys = range(100000)
vals = range(100000)
for f in funcs:
    timit(f)

print "str/hex (50000) keys and vals:"
keys = [str(i) for i in range(50000)]
vals = [hex(i) for i in range(50000)]
for f in funcs:
    timit(f)

print "hex/str (50000) keys and vals:"
keys, vals = vals, keys
for f in funcs:
    timit(f)
::: end of file


D:\Python21>python po.py
int(100000) keys and vals:
  mappy: 1.42234294517
  zippy: 1.65447433837
  fancy: 0.453661721347
  emile: 0.474877260971
  xmile: 0.443137761046
str/hex (50000) keys and vals:
  mappy: 0.694674408431
  zippy: 0.913754146476
  fancy: 0.307802162621
  emile: 0.305412753461
  xmile: 0.281936871324
hex/str (50000) keys and vals:
  mappy: 0.686130866875
  zippy: 0.903102795718
  fancy: 0.303567268028
  emile: 0.30191454447
  xmile: 0.281214433339

As we can see, to a first order approximation, the key issue
is AVOIDING the zip() and map() solutions, SUBSTANTIALLY slower
than any of the other ideas.  The others, roughly, run head to
head, with a tiny (but apparently repeatable) advantage to
the loop you decry as long as xrange() is used.


> - The worse aspect is that you built an intermediate index list (the
> range() one) that had nothing to do with the original problem. It's just
an
> artifact; it's not part of the problem space.

map() builds an intermediate list which is an artifact, too, as does
zip!  map() in "fancy" happens to produce a copy of the list of values,
zip (and map in "mappy") are deliberately used to produce a list of
(key,value) pairs.  And so?!

As for range (which builds an intermediate "artifact" list) vs xrange
(which, instead, builds an intermediate "artifact" iterator object,
so to speak), I don't see why the difference should be considered
so huge as to make range "the worse aspect".  Whatever intermediate
structures may help in solving the problem, surely their costs, if
any, are to be measured vs their advantages.  A list of indices is
conceptually very simple, though a BIT costlier than the mysterious
xrange object in this case.  xrange makes you forgive its slight
mystery by delivering a wee little bit of extra speed.  Surely it's
a minor point, hardly "the worse aspect" of anything.

I strongly object to characterizing implementations aspects as
"not part of the problem space" AS IF that was a bad thing.  Just
as the map is not the territory, so the solution is not the
problem.  Pretending otherwise is a horrible conceptual error.


> - There is also subtler reason to avoid it: it is expressed in a
imperative
> way; you are telling The Computer to do exactly what you want it do to
step
> by step. This low level approach is excellent if you are doing low level
> stuff. However, this is the kind of approach that makes you lose the focus
> on the higher levels of the problem. This leads to problems when trying to
> propose more abstract solutions; you are so deep in implementation details
> that the higher level abstractions become harder to understand.

Any of these solutions will be wrapped up into a function, and
thought-of "unitarily" from all the rest of the program.  HOW
the function is implemented internally is, at most, a third-order
concern.

Just for fun, here's a q&d proof-of-concept C implementation
of the same function:
::: file apo.c
#include "Python.h"

static PyObject *
make_dict(PyObject* self, PyObject* args)
{
    int i, rc;
    PyObject *keys, *values, *dict;
    if(!PyArg_ParseTuple(args, "OO", &keys, &values))
        return 0;
    if(!PySequence_Check(keys)) {
        PyErr_SetString(PyExc_TypeError, "First arg not sequence");
        return 0;
    }
    if(!PySequence_Check(values)) {
        PyErr_SetString(PyExc_TypeError, "Second arg not sequence");
        return 0;
    }
    dict = PyDict_New();
    for(i=0; ;i++) {
        PyObject *key, *value;
        key = PySequence_GetItem(keys, i);
        if(!key) { PyErr_Clear(); break; }
        value = PySequence_GetItem(values, i);
        if(!value) { Py_DECREF(key); PyErr_Clear(); break; }
        rc = PyDict_SetItem(dict, key, value);
        Py_DECREF(key); Py_DECREF(value);
        if(rc<0) { Py_DECREF(dict); return 0; }
    }
    return dict;
}

static PyMethodDef apo_module_functions[] =
{
    { "dict", (PyCFunction)make_dict, METH_VARARGS },
    { 0, 0 }
};

void
initapo(void)
{
    PyObject* apo_module = Py_InitModule("apo", apo_module_functions);
}
::: end of file

with the obvious setup.py:

import sys
from distutils.core import setup, Extension

apo_ext = Extension('apo', sources=['apo.c'])

setup (name = "apo",
       ext_modules = [ apo_ext ]
)

and here's the performance -- same test as
before with the obvious apopy() and aponu()
functions, to update-existing vs create-new:

def apopy(d1, keys, vals):
    d1.update(apo.dict(keys, vals))

def aponu(d1, keys, vals):
    _ = apo.dict(keys, vals)


D:\Python21>python po.py
int(100000) keys and vals:
  apopy: 0.366732820307
  aponu: 0.265795159498
  mappy: 1.43757364761
  zippy: 1.66614313659
  fancy: 0.456552311383
  emile: 0.47589722272
  xmile: 0.443106751527
str/hex (50000) keys and vals:
  apopy: 0.300958277949
  aponu: 0.21740606211
  mappy: 0.701232502669
  zippy: 0.923194449799
  fancy: 0.298968640157
  emile: 0.309439800466
  xmile: 0.282484147431
hex/str (50000) keys and vals:
  apopy: 0.3036234204
  aponu: 0.215645224283
  mappy: 0.677776734815
  zippy: 0.899023786815
  fancy: 0.300057325706
  emile: 0.315924142335
  xmile: 0.281894966568

We can get a pretty modest benefit here by
going to C from Python, compared to most cases.

Isn't it funny that the xmile you decry so
much turns out to be FASTER than a special
purpose C extension when updating an existing
dictionary with 50,000 string keys/values?
No doubt we'll be able to tune the C ext, if
we need that, by hardcoding the args as lists,
etc, etc, but I don't expect all that much
advantage here (left as an exercise for the
reader!-).

The key issue is avoiding the seriously wrong
approach (the 'obvious, functional' mappy and
zippy).  Compared to what you can gain by such
avoidance, there are but small further potatoes
to be gained, even by pushing HARD.


> Someone has written about this a few days ago. C/C++ programmers tend to
> avoid 'complex' data structures such as mappings when prototyping. The
> reason is that it is much harder to quickly write-compile-run a working
> prototype of these high level structures in a low level language. You end
> up having to write a lot of stuff just to make you prototype work. That's
> when many people switch to Python :-)

The "C/C++" generalization falls down as usual here.  C++ can
be pretty high-level, and makes writing a std::map a snap --
its problems are VERY different and can be summarized as HUGE
LANGUAGE COMPLEXITY (due in good part to stretching up to work
as a high-level language while keeping the low-level access
and speed, as it happens).  C never really tries to be high
level, so prototyping is emphatically NOT its forte, of course.


> Take a look at some of the solutions proposed by Duncan Booth in a
previous
> message:
>
> 1) for key, value in zip(keys,values): dict[key] = value
> 2) for key, value in map(None, keys,values): dict[key] = value

It does not appear to me that you have brought any arguments
worth taking a performance hit of hundreds of percent on the
altar of the "elegance" of these approaches wrt a loop that
"anybody from the street" will understand.


> 3) map(dict.setdefault, keys, values)

I was the one first proposing this on this thread, I believe.

Originally I thought its speed justified its excessive fanciness.

Having now better measured what a plain loop with xrange can
do, I vote for the plain loop.  Fancy approaches are only
justified when they bring real advantages of either speed,
or elegance and handiness.  List comprehensions can be very
elegant and handy indeed, so, for me, they easily "carry
their weight".  This particular idiom is not really elegant
AND turns out to be roughly comparable and (on my box) a
bit slower than the plain, obvious solution.  Therefore
it qualifies as misplaced fanciness in my opinion.


> The first two ones still use a explicit 'for' loop. However, we avoid
> creating a intermediate index list, that truly had nothing to do with the

...in favour of creating ANOTHER intermediate list (of tuples)
which has just about as little "to do with the original problem".

> original problem. Both the zip() or the map() calls are being used for the
> same reason: to 'pair up' all elements of both lists.

In other words, to create an intermediate tuples list instead
of an intermediate index list (or, using xrange, an intermediate
"index"-object).  Six of one, half a dozen of the other.


> The third option is the simplest one. It is written in a functional
> fashion, and it is pretty simple: map all keys to a value, using the

Actually: map all PAIRS of keys and values.

> setdefault call. My *only* remark here has to do with the 'setdefault'
> method name, but that's another issue <wink>

And NO remark about the "intermediate list" (as it happens, a copy
of the list of values, assuming no duplicates in the list of keys)
that we're creating AND silently throwing away?  Doesn't THAT one
"truly have nothing to do" etc?-)


Alex






More information about the Python-list mailing list