Question about optimization

Wei Hao weihao89 at gmail.com
Thu Jul 24 17:19:41 EDT 2008


Hi:

I'm pretty new to python and I have some optimization issues. I'll show you
the piece of code which is causing it, with pseudo-code before it and
comments. I'm accessing a gigantic table (like 15 million rows) in SQL.


d is some dictionary, r is a precompiled regex string
Big loop, so I search through the table in chunks given by delta
    SQL query ("select * from table where rowID >= n and rowID < (n +
delta)"), result of query stored in a. Each individual row is a[n1], columns
of rows are a[n1][n2].

    t1 = time.clock() #to track speed
    for m in a:
        for temp in m:
            if str(temp) == "None": #basically skip over the columns that
are null for this particular row
                continue
            s += temp #get the columns into one long string
        s = s.replace("between", "")
        s = s.replace("and", "")
        s = s.replace("where", "")
        s = s.replace("like", "") #these words cause problems, need to get
rid of them.
        b = re.findall(r,s) #looking for the stuff I want, always at least
one per row of table, about 3-4 on average.
        for t in b: #store count of things I want in dictionary
            if t in d:
                d[t] += 1
            else:
                d[t] = 1
    print n, (time.clock()-t1) #to track speed


I am 100% sure it's this code snippet that's the cause of my problems.
Here's what I can tell you. Each chunk of rows that I grab is essentially
equal in size (rowID skips over stuff, but rather arbitrarily). The time it
takes to fetch the SQL query doesn't change. But as the program progresses,
this snippet gets slower. Here's the output:

2500 0.441551299341
5000 1.26162739664
7500 2.35092688403
10000 3.48417469666
12500 4.59031305491
15000 5.78972588775
17500 6.28305527139
20000 6.73344570903
22500 8.31732146487
25000 9.65322872159
27500 8.98186042757
30000 11.8042818095
32500 12.1965593712
35000 13.2735763291
37500 14.0282617344

What is it in the code snippet that slows down as n increases? Is there
something about the way low level python functions I don't understand which
is slowing me down?

Thanks in advance for your time.

-Wei
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20080724/cb23c3e0/attachment.html>


More information about the Python-list mailing list