Slow down while creating a big list and iterating over it

Sat Jan 30 17:08:56 EST 2010

* marc magrans de abril:
> Dear colleagues,
> 
> I was doing a small program to classify log files for a cluster of
> PCs, I just wanted to simplify a quite repetitive task in order to
> find errors and so.
> 
> My first naive implementation was something like:
>     patterns = []
>     while(logs):
>         pattern = logs[0]
>         new_logs = [l for l in logs if dist(pattern,l)>THERESHOLD]
>         entry = (len(logs)-len(new_logs),pattern)
>         patterns.append(entry)
>         logs = new_logs
> 
> Where dist(...) is the levenshtein distance (i.e. edit distance) and
> logs is something like 1.5M logs (700 MB file). I thought that python
> will be an easy choice although not really fast..
> 
> I was not surprised when the first iteration of the while loop was
> taking ~10min. I thought "not bad, let's how much it takes". However,
> it seemed that the second iteration never finished.
> 
> My surprise was big when I added a print instead of the list
> comprehension:
> new_logs=[]
> for count,l in enumerate(logs):
>    print count
>    if dist(pattern,l)>THERESHOLD:
>       new_logs.append(l)
> 
> The surprise was that the displayed counter was running ~10 times
> slower on the second iteration of the while loop.
> 
> I am a little lost. Anyone knows the reson of this behavior?

It's on line 42 of your program. :-) That is, it's in the dist function. 
Evidently it doesn't like a more complex 'pattern'.

> How should I write a program that deals with large data sets in python?

As in any other language. Try to avoid repeating the same computations. Try to 
make the data fit the computational task.

Cheers & hth.,

- Alf