Efficient processing of large nuumeric data file

Fri Jan 18 13:06:56 EST 2008

> for line in file:

The first thing I would try is just doing a

  for line in file:
    pass

to see how much time is consumed merely by iterating over the
file.  This should give you a baseline from which you can base
your timings

> 	data = line.split()
> 	first = int(data[0])
> 
> 	if len(data) == 1:
> 		count = 1
> 	else:
> 		count = int(data[1])    # more than one repetition

Well, some experiments I might try:

  try:
    first, count = map(int, data)
  except:
    first = int(data[0])
    count = 1

or possibly

  first = int(data[0])
  try:
    count = int(data[1])
  except:
    count = 0

or even

  # pad it to contain at least two items
  # then slice off the first two
  # and then map() calls to int()
  first, count = map(int,(data + [1])[:2])

I don't know how efficient len() is (if it's internally linearly
counting the items in data, or if it's caching the length as data
is created/assigned/modifed) and how that efficiency compares to
try/except blocks, map() or int() calls.

I'm not sure any of them is more or less "pythonic", but they
should all do the same thing.

> 	if first in hist:       # add the information to the histogram
> 		hist[first]+=count
> 	else:
> 		hist[first]=count

This might also be written as

  hist[first] = hist.get(first, 0) + count

> Is a dictionary the right way to do this?  In any given file, there is
> an upper bound on the data, so it seems to me that some kind of array
> (numpy?) would be more efficient, but the upper bound changes in each
> file.

I'm not sure an array would net you great savings here, since the
upper-bound seems to be an unknown.  If "first" has a known
maximum (surely, the program generating this file has an idea to
the range of allowed values), you could just create an array the
length of the span of numbers, initialized to zero, which would
reduce the hist.get() call to just

  hist[first] += count

and then you'd iterate over hist (which would already be sorted
because it's in index order) and use those where count != 0 to
avoid the holes.

Otherwise, your code looks good...the above just riff on various
ways of rewriting your code in case one nets you extra
time-savings per loop.

-tkc