Efficient processing of large nuumeric data file
Matimus
mccredie at gmail.com
Fri Jan 18 12:55:56 EST 2008
On Jan 18, 9:15 am, David Sanders <dpsand... at gmail.com> wrote:
> Hi,
>
> I am processing large files of numerical data. Each line is either a
> single (positive) integer, or a pair of positive integers, where the
> second represents the number of times that the first number is
> repeated in the data -- this is to avoid generating huge raw files,
> since one particular number is often repeated in the data generation
> step.
>
> My question is how to process such files efficiently to obtain a
> frequency histogram of the data (how many times each number occurs in
> the data, taking into account the repetitions). My current code is as
> follows:
>
> -------------------
> #!/usr/bin/env python
> # Counts the occurrences of integers in a file and makes a histogram
> of them
> # Allows for a second field which gives the number of counts of each
> datum
>
> import sys
> args = sys.argv
> num_args = len(args)
>
> if num_args < 2:
> print "Syntaxis: count.py archivo"
> sys.exit();
>
> name = args[1]
> file = open(name, "r")
>
> hist = {} # dictionary for histogram
> num = 0
>
> for line in file:
> data = line.split()
> first = int(data[0])
>
> if len(data) == 1:
> count = 1
> else:
> count = int(data[1]) # more than one repetition
>
> if first in hist: # add the information to the histogram
> hist[first]+=count
> else:
> hist[first]=count
>
> num+=count
>
> keys = hist.keys()
> keys.sort()
>
> print "# i fraction hist[i]"
> for i in keys:
> print i, float(hist[i])/num, hist[i]
> ---------------------
>
> The data files are large (~100 million lines), and this code takes a
> long time to run (compared to just doing wc -l, for example).
>
> Am I doing something very inefficient? (Any general comments on my
> pythonic (or otherwise) style are also appreciated!) Is
> "line.split()" efficient, for example?
>
> Is a dictionary the right way to do this? In any given file, there is
> an upper bound on the data, so it seems to me that some kind of array
> (numpy?) would be more efficient, but the upper bound changes in each
> file.
My first suggestion is to wrap your code in a function. Functions run
much faster in python than module level code, so that will give you a
speed up right away. My second suggestion is to look into using
defaultdict for your histogram. A dictionary is a very appropriate way
to store this data. There has been some mention of a bag type, which
would do exactly what you need, but unfortunately there is not a built
in bag type (yet). I would write it something like this:
from collections import defaultdict
def get_hist(file_name):
hist = defaultdict(int)
f = open(filename,"r")
for line in f:
vals = line.split()
val = int(vals[0])
try: # don't look to see if you will cause an error,
# just cause it and then deal with it
cnt = int(vals[1])
except IndexError:
cnt = 1
hist[val] += cnt
return hist
HTH
Matt
More information about the Python-list
mailing list