Efficient processing of large nuumeric data file

Fri Jan 18 12:55:56 EST 2008

On Jan 18, 9:15 am, David Sanders <dpsand... at gmail.com> wrote:
> Hi,
>
> I am processing large files of numerical data.  Each line is either a
> single (positive) integer, or a pair of positive integers, where the
> second represents the number of times that the first number is
> repeated in the data -- this is to avoid generating huge raw files,
> since one particular number is often repeated in the data generation
> step.
>
> My question is how to process such files efficiently to obtain a
> frequency histogram of the data (how many times each number occurs in
> the data, taking into account the repetitions).  My current code is as
> follows:
>
> -------------------
> #!/usr/bin/env python
> # Counts the occurrences of integers in a file and makes a histogram
> of them
> # Allows for a second field which gives the number of counts of each
> datum
>
> import sys
> args = sys.argv
> num_args = len(args)
>
> if num_args < 2:
>         print "Syntaxis: count.py archivo"
>         sys.exit();
>
> name = args[1]
> file = open(name, "r")
>
> hist = {}   # dictionary for histogram
> num = 0
>
> for line in file:
>         data = line.split()
>         first = int(data[0])
>
>         if len(data) == 1:
>                 count = 1
>         else:
>                 count = int(data[1])    # more than one repetition
>
>         if first in hist:       # add the information to the histogram
>                 hist[first]+=count
>         else:
>                 hist[first]=count
>
>         num+=count
>
> keys = hist.keys()
> keys.sort()
>
> print "# i  fraction   hist[i]"
> for i in keys:
>         print i, float(hist[i])/num, hist[i]
> ---------------------
>
> The data files are large (~100 million lines), and this code takes a
> long time to run (compared to just doing wc -l, for example).
>
> Am I doing something very inefficient?  (Any general comments on my
> pythonic (or otherwise) style are also appreciated!)  Is
> "line.split()" efficient, for example?
>
> Is a dictionary the right way to do this?  In any given file, there is
> an upper bound on the data, so it seems to me that some kind of array
> (numpy?) would be more efficient, but the upper bound changes in each
> file.

My first suggestion is to wrap your code in a function. Functions run
much faster in python than module level code, so that will give you a
speed up right away. My second suggestion is to look into using
defaultdict for your histogram. A dictionary is a very appropriate way
to store this data. There has been some mention of a bag type, which
would do exactly what you need, but unfortunately there is not a built
in bag type (yet). I would write it something like this:

from collections import defaultdict

def get_hist(file_name):
    hist = defaultdict(int)
    f = open(filename,"r")
    for line in f:
        vals = line.split()
        val = int(vals[0])
        try: # don't look to see if you will cause an error,
             # just cause it and then deal with it
            cnt = int(vals[1])
        except IndexError:
            cnt = 1
        hist[val] += cnt
    return hist

HTH

Matt