Generator slower than iterator?

Fri Dec 19 06:31:07 EST 2008

> FedericoMoreirawrote:
> > Hi all,
>
> > Im parsing a 4.1GB apache log to have stats about how many times an ip
> > request something from the server.
>
> > The first design of the algorithm was
>
> > for line in fileinput.input(sys.argv[1:]):
> >     ip = line.split()[0]
> >     if match_counter.has_key(ip):
> >         match_counter[ip] += 1
> >     else:
> >         match_counter[ip] = 1
 . . .
> > Should i leave fileinput behind?

Yes.  fileinput is slow because it does a lot more than just read
files.

> > Am i using generators with the wrong aproach?

No need for a generator here.  The time is being lost with fileinput,
split, and the counting code.  Try this instead:

match_counter = collections.defaultdict(int)
for filename in sys.argv[1:]:
    for line in open(filename):
        ip, sep, rest = line.partition(' ')
        match_counter[ip] += 1

If you're on *nix, there's a fast command line approach:

    cut -d' ' -f1 filelist | sort | uniq -c