Python 3.0 - is this true?

"Martin v. Löwis" martin at v.loewis.de
Mon Nov 24 17:19:08 EST 2008


> Sorry for the delay in replying. 

Thanks, and I'm in turn sorry myself, too. Here is my experiment:

I only looked at the first test case, where a bigfile.txt has
occasionally None lines. To run this test, I created a file with
33079296 and 704512 lines, by repeatedly cat'ing /etc/passwd.

I then ran this program:

import time,random
l = []

t=time.time()
for line in open("x"):
    x = random.randint(0,100)
    if x < 4: l.append(None)
    else: l.append(line)
print(time.time()-t)

t=time.time()
for i in range(10):
    L = l[:]
    L.sort()
print(time.time()-t)

def key(n):
    return n or ""

t=time.time()
for i in range(10):
    L = l[:]
    L.sort(key=key)
print(time.time()-t)

This key function is based on the assumption that
there won't be any strings in l, which is reasonable,
because they are all lines read from the input file
(i.e. contain \n at least). If truly empty strings
where possible, I'd map them to chr(0), which I would
assume couldn't occur elsewhere in the input.

With 2.5, I get these results (rounded)

3.8s
7.6s
14.3s

So with the key function, the overhead appears to have doubled.
I think the ratio should decrease as lists grow larger (because
the key function is linear, and the sort is nlogn, assuming the
overhead is in computing the key function), but I haven'd done
any series varying the file size.

With 3.0(rc3), I get (omitting the key-less sorting, as it won't
work)

24.8s
15.9s

So now the IO is much larger than the actual sorting, which is
about the same in 2.5 and 3.0. If you wonder where the overhead
of IO comes from: part of it probably from the UTF-8 decoding,
that the IO library now does transparently.

I don't know whether this overhead qualifies as "not
insignificant", most people probably would claim that a doubling
in time is indeed significant. OTOH, not knowing what the original
problem is (e.g. whether it's interactive or in a batch mode, and
whether bigfile.txt is much smaller or much larger than 32MiB),
I would probably
a) evaluate whether the increase in IO time is acceptable, and
b) if it is, just go with 3.0, as the increase in sorting time
   is much smaller than the increase in IO time.

Regards,
Martin



More information about the Python-list mailing list