Efficient processing of large nuumeric data file
Tim Chase
python.list at tim.thechases.com
Fri Jan 18 13:06:56 EST 2008
> for line in file:
The first thing I would try is just doing a
for line in file:
pass
to see how much time is consumed merely by iterating over the
file. This should give you a baseline from which you can base
your timings
> data = line.split()
> first = int(data[0])
>
> if len(data) == 1:
> count = 1
> else:
> count = int(data[1]) # more than one repetition
Well, some experiments I might try:
try:
first, count = map(int, data)
except:
first = int(data[0])
count = 1
or possibly
first = int(data[0])
try:
count = int(data[1])
except:
count = 0
or even
# pad it to contain at least two items
# then slice off the first two
# and then map() calls to int()
first, count = map(int,(data + [1])[:2])
I don't know how efficient len() is (if it's internally linearly
counting the items in data, or if it's caching the length as data
is created/assigned/modifed) and how that efficiency compares to
try/except blocks, map() or int() calls.
I'm not sure any of them is more or less "pythonic", but they
should all do the same thing.
> if first in hist: # add the information to the histogram
> hist[first]+=count
> else:
> hist[first]=count
This might also be written as
hist[first] = hist.get(first, 0) + count
> Is a dictionary the right way to do this? In any given file, there is
> an upper bound on the data, so it seems to me that some kind of array
> (numpy?) would be more efficient, but the upper bound changes in each
> file.
I'm not sure an array would net you great savings here, since the
upper-bound seems to be an unknown. If "first" has a known
maximum (surely, the program generating this file has an idea to
the range of allowed values), you could just create an array the
length of the span of numbers, initialized to zero, which would
reduce the hist.get() call to just
hist[first] += count
and then you'd iterate over hist (which would already be sorted
because it's in index order) and use those where count != 0 to
avoid the holes.
Otherwise, your code looks good...the above just riff on various
ways of rewriting your code in case one nets you extra
time-savings per loop.
-tkc
More information about the Python-list
mailing list