Making a dict from two lists/tuples
Alex Martelli
aleaxit at yahoo.com
Fri May 25 09:05:45 EDT 2001
"Carlos Ribeiro" <cribeiro at mail.inet.com.br> wrote in message
news:mailman.990732696.22272.python-list at python.org...
> At 06:51 24/05/01 -0400, Andreas Jung wrote:
> >dict={}
> >
> >for i in range(len(keys)):
> > dict[keys[i]] = values[i]
>
> I think that this kind of construct has been beaten to the death many
> times, but it keeps being used, mainly because it looks similar to the
> 'for' statement of C or Pascal.
...or maybe because it CAN be faster than most sensible alternatives
in some case...?-)
> This kind of loop should be avoided in Python for several reasons:
>
> - Explicit for loops in Python are generally much slower than implicit
> loops, such the ones provided by map(), zip() or list comprehensions. *In
An excellent point, except that it happens to be false. Please
re-read this thread: I did several performance measurements and
posted numbers (apparently some few others also measured, but I
saw no other numbers posted). List comprehensions don't let you
build a dictionary (only a list), zip loses BIG-TIME when the
len(keys) is biggish, and the only approach that happens to be
a tad faster than this obvious and elementary one on my box is
a tricky-indeed map(dict.setdefault, keys, values) [throwing away
the result list...].
> - Range(len(keys)) builds a list. This takes some time, and uses up a
> potentially big amount of memory. This is not an issue in some cases;
> however, if the body of the 'for' statement is comprised of simple
> statements, the performance hit may be noticeable.
...in which case you may want to try xrange() instead, of course.
Here's a followup version of my timing-script, with better
organization and more variation:
::: file po.py
import time
def timit(funz):
d1 = {}
start = time.clock()
funz(d1, keys, vals)
stend = time.clock()
print " %s: %s"%(funz.__name__,stend-start)
def mappy(d1, keys, vals):
for k, v in map(None, keys, vals):
d1[k] = v
def zippy(d1, keys, vals):
for k, v in zip(keys, vals):
d1[k] = v
def fancy(d1, keys, vals):
_ = map(d1.setdefault, keys, vals)
def emile(d1, keys, vals):
for i in range(len(keys)):
d1[keys[i]] = vals[i]
def xmile(d1, keys, vals):
for i in xrange(len(keys)):
d1[keys[i]] = vals[i]
funcs = mappy, zippy, fancy, emile, xmile
print "int(100000) keys and vals:"
keys = range(100000)
vals = range(100000)
for f in funcs:
timit(f)
print "str/hex (50000) keys and vals:"
keys = [str(i) for i in range(50000)]
vals = [hex(i) for i in range(50000)]
for f in funcs:
timit(f)
print "hex/str (50000) keys and vals:"
keys, vals = vals, keys
for f in funcs:
timit(f)
::: end of file
D:\Python21>python po.py
int(100000) keys and vals:
mappy: 1.42234294517
zippy: 1.65447433837
fancy: 0.453661721347
emile: 0.474877260971
xmile: 0.443137761046
str/hex (50000) keys and vals:
mappy: 0.694674408431
zippy: 0.913754146476
fancy: 0.307802162621
emile: 0.305412753461
xmile: 0.281936871324
hex/str (50000) keys and vals:
mappy: 0.686130866875
zippy: 0.903102795718
fancy: 0.303567268028
emile: 0.30191454447
xmile: 0.281214433339
As we can see, to a first order approximation, the key issue
is AVOIDING the zip() and map() solutions, SUBSTANTIALLY slower
than any of the other ideas. The others, roughly, run head to
head, with a tiny (but apparently repeatable) advantage to
the loop you decry as long as xrange() is used.
> - The worse aspect is that you built an intermediate index list (the
> range() one) that had nothing to do with the original problem. It's just
an
> artifact; it's not part of the problem space.
map() builds an intermediate list which is an artifact, too, as does
zip! map() in "fancy" happens to produce a copy of the list of values,
zip (and map in "mappy") are deliberately used to produce a list of
(key,value) pairs. And so?!
As for range (which builds an intermediate "artifact" list) vs xrange
(which, instead, builds an intermediate "artifact" iterator object,
so to speak), I don't see why the difference should be considered
so huge as to make range "the worse aspect". Whatever intermediate
structures may help in solving the problem, surely their costs, if
any, are to be measured vs their advantages. A list of indices is
conceptually very simple, though a BIT costlier than the mysterious
xrange object in this case. xrange makes you forgive its slight
mystery by delivering a wee little bit of extra speed. Surely it's
a minor point, hardly "the worse aspect" of anything.
I strongly object to characterizing implementations aspects as
"not part of the problem space" AS IF that was a bad thing. Just
as the map is not the territory, so the solution is not the
problem. Pretending otherwise is a horrible conceptual error.
> - There is also subtler reason to avoid it: it is expressed in a
imperative
> way; you are telling The Computer to do exactly what you want it do to
step
> by step. This low level approach is excellent if you are doing low level
> stuff. However, this is the kind of approach that makes you lose the focus
> on the higher levels of the problem. This leads to problems when trying to
> propose more abstract solutions; you are so deep in implementation details
> that the higher level abstractions become harder to understand.
Any of these solutions will be wrapped up into a function, and
thought-of "unitarily" from all the rest of the program. HOW
the function is implemented internally is, at most, a third-order
concern.
Just for fun, here's a q&d proof-of-concept C implementation
of the same function:
::: file apo.c
#include "Python.h"
static PyObject *
make_dict(PyObject* self, PyObject* args)
{
int i, rc;
PyObject *keys, *values, *dict;
if(!PyArg_ParseTuple(args, "OO", &keys, &values))
return 0;
if(!PySequence_Check(keys)) {
PyErr_SetString(PyExc_TypeError, "First arg not sequence");
return 0;
}
if(!PySequence_Check(values)) {
PyErr_SetString(PyExc_TypeError, "Second arg not sequence");
return 0;
}
dict = PyDict_New();
for(i=0; ;i++) {
PyObject *key, *value;
key = PySequence_GetItem(keys, i);
if(!key) { PyErr_Clear(); break; }
value = PySequence_GetItem(values, i);
if(!value) { Py_DECREF(key); PyErr_Clear(); break; }
rc = PyDict_SetItem(dict, key, value);
Py_DECREF(key); Py_DECREF(value);
if(rc<0) { Py_DECREF(dict); return 0; }
}
return dict;
}
static PyMethodDef apo_module_functions[] =
{
{ "dict", (PyCFunction)make_dict, METH_VARARGS },
{ 0, 0 }
};
void
initapo(void)
{
PyObject* apo_module = Py_InitModule("apo", apo_module_functions);
}
::: end of file
with the obvious setup.py:
import sys
from distutils.core import setup, Extension
apo_ext = Extension('apo', sources=['apo.c'])
setup (name = "apo",
ext_modules = [ apo_ext ]
)
and here's the performance -- same test as
before with the obvious apopy() and aponu()
functions, to update-existing vs create-new:
def apopy(d1, keys, vals):
d1.update(apo.dict(keys, vals))
def aponu(d1, keys, vals):
_ = apo.dict(keys, vals)
D:\Python21>python po.py
int(100000) keys and vals:
apopy: 0.366732820307
aponu: 0.265795159498
mappy: 1.43757364761
zippy: 1.66614313659
fancy: 0.456552311383
emile: 0.47589722272
xmile: 0.443106751527
str/hex (50000) keys and vals:
apopy: 0.300958277949
aponu: 0.21740606211
mappy: 0.701232502669
zippy: 0.923194449799
fancy: 0.298968640157
emile: 0.309439800466
xmile: 0.282484147431
hex/str (50000) keys and vals:
apopy: 0.3036234204
aponu: 0.215645224283
mappy: 0.677776734815
zippy: 0.899023786815
fancy: 0.300057325706
emile: 0.315924142335
xmile: 0.281894966568
We can get a pretty modest benefit here by
going to C from Python, compared to most cases.
Isn't it funny that the xmile you decry so
much turns out to be FASTER than a special
purpose C extension when updating an existing
dictionary with 50,000 string keys/values?
No doubt we'll be able to tune the C ext, if
we need that, by hardcoding the args as lists,
etc, etc, but I don't expect all that much
advantage here (left as an exercise for the
reader!-).
The key issue is avoiding the seriously wrong
approach (the 'obvious, functional' mappy and
zippy). Compared to what you can gain by such
avoidance, there are but small further potatoes
to be gained, even by pushing HARD.
> Someone has written about this a few days ago. C/C++ programmers tend to
> avoid 'complex' data structures such as mappings when prototyping. The
> reason is that it is much harder to quickly write-compile-run a working
> prototype of these high level structures in a low level language. You end
> up having to write a lot of stuff just to make you prototype work. That's
> when many people switch to Python :-)
The "C/C++" generalization falls down as usual here. C++ can
be pretty high-level, and makes writing a std::map a snap --
its problems are VERY different and can be summarized as HUGE
LANGUAGE COMPLEXITY (due in good part to stretching up to work
as a high-level language while keeping the low-level access
and speed, as it happens). C never really tries to be high
level, so prototyping is emphatically NOT its forte, of course.
> Take a look at some of the solutions proposed by Duncan Booth in a
previous
> message:
>
> 1) for key, value in zip(keys,values): dict[key] = value
> 2) for key, value in map(None, keys,values): dict[key] = value
It does not appear to me that you have brought any arguments
worth taking a performance hit of hundreds of percent on the
altar of the "elegance" of these approaches wrt a loop that
"anybody from the street" will understand.
> 3) map(dict.setdefault, keys, values)
I was the one first proposing this on this thread, I believe.
Originally I thought its speed justified its excessive fanciness.
Having now better measured what a plain loop with xrange can
do, I vote for the plain loop. Fancy approaches are only
justified when they bring real advantages of either speed,
or elegance and handiness. List comprehensions can be very
elegant and handy indeed, so, for me, they easily "carry
their weight". This particular idiom is not really elegant
AND turns out to be roughly comparable and (on my box) a
bit slower than the plain, obvious solution. Therefore
it qualifies as misplaced fanciness in my opinion.
> The first two ones still use a explicit 'for' loop. However, we avoid
> creating a intermediate index list, that truly had nothing to do with the
...in favour of creating ANOTHER intermediate list (of tuples)
which has just about as little "to do with the original problem".
> original problem. Both the zip() or the map() calls are being used for the
> same reason: to 'pair up' all elements of both lists.
In other words, to create an intermediate tuples list instead
of an intermediate index list (or, using xrange, an intermediate
"index"-object). Six of one, half a dozen of the other.
> The third option is the simplest one. It is written in a functional
> fashion, and it is pretty simple: map all keys to a value, using the
Actually: map all PAIRS of keys and values.
> setdefault call. My *only* remark here has to do with the 'setdefault'
> method name, but that's another issue <wink>
And NO remark about the "intermediate list" (as it happens, a copy
of the list of values, assuming no duplicates in the list of keys)
that we're creating AND silently throwing away? Doesn't THAT one
"truly have nothing to do" etc?-)
Alex
More information about the Python-list
mailing list