Efficient processing of large nuumeric data file

Paul Rubin http
Fri Jan 18 12:58:57 EST 2008


David Sanders <dpsanders at gmail.com> writes:
> The data files are large (~100 million lines), and this code takes a
> long time to run (compared to just doing wc -l, for example).

wc is written in carefully optimized C and will almost certainly
run faster than any python program.

> Am I doing something very inefficient?  (Any general comments on my
> pythonic (or otherwise) style are also appreciated!)  Is
> "line.split()" efficient, for example?

Your implementation's efficiency is not too bad.  Stylistically it's
not quite fluent but there's nothing to really criticize--you may
develop a more concise style with experience, or maybe not.
One small optimization you could make is to use collections.defaultdict
to hold the counters instead of a regular dict, so you can get rid of
the test for whether a key is in the dict.  

Keep an eye on your program's memory consumption as it runs.  The
overhead of a pair of python ints and a dictionary cell to hold them
is some dozens of bytes at minimum.  If you have a lot of distinct
keys and not enough memory to hold them all in the large dict, your
system may be thrashing.  If that is happening, the two basic
solutions are 1) buy more memory; or, 2) divide the input into smaller
pieces, attack them separately, and merge the results.

If I were writing this program and didn't have to run it too often,
I'd probably use the unix "sort" utility to sort the input (that
utility does an external disk sort if the input is large enough to
require it) then make a single pass over the sorted list to count up
each group of keys (see itertools.groupby for a convenient way to do
that).



More information about the Python-list mailing list