Slow down while creating a big list and iterating over it

Sat Jan 30 15:36:59 EST 2010

Dear colleagues,

I was doing a small program to classify log files for a cluster of
PCs, I just wanted to simplify a quite repetitive task in order to
find errors and so.

My first naive implementation was something like:
    patterns = []
    while(logs):
        pattern = logs[0]
        new_logs = [l for l in logs if dist(pattern,l)>THERESHOLD]
        entry = (len(logs)-len(new_logs),pattern)
        patterns.append(entry)
        logs = new_logs

Where dist(...) is the levenshtein distance (i.e. edit distance) and
logs is something like 1.5M logs (700 MB file). I thought that python
will be an easy choice although not really fast..

I was not surprised when the first iteration of the while loop was
taking ~10min. I thought "not bad, let's how much it takes". However,
it seemed that the second iteration never finished.

My surprise was big when I added a print instead of the list
comprehension:
new_logs=[]
for count,l in enumerate(logs):
   print count
   if dist(pattern,l)>THERESHOLD:
      new_logs.append(l)

The surprise was that the displayed counter was running ~10 times
slower on the second iteration of the while loop.

I am a little lost. Anyone knows the reson of this behavior?  How
should I write a program that deals with large data sets in python?

Thanks a lot!
marc magrans de abril