Generator slower than iterator?
Raymond Hettinger
python at rcn.com
Fri Dec 19 06:31:07 EST 2008
> FedericoMoreirawrote:
> > Hi all,
>
> > Im parsing a 4.1GB apache log to have stats about how many times an ip
> > request something from the server.
>
> > The first design of the algorithm was
>
> > for line in fileinput.input(sys.argv[1:]):
> > ip = line.split()[0]
> > if match_counter.has_key(ip):
> > match_counter[ip] += 1
> > else:
> > match_counter[ip] = 1
. . .
> > Should i leave fileinput behind?
Yes. fileinput is slow because it does a lot more than just read
files.
> > Am i using generators with the wrong aproach?
No need for a generator here. The time is being lost with fileinput,
split, and the counting code. Try this instead:
match_counter = collections.defaultdict(int)
for filename in sys.argv[1:]:
for line in open(filename):
ip, sep, rest = line.partition(' ')
match_counter[ip] += 1
If you're on *nix, there's a fast command line approach:
cut -d' ' -f1 filelist | sort | uniq -c
More information about the Python-list
mailing list