Efficient processing of large nuumeric data file
Jorgen Grahn
grahn+nntp at snipabacken.dyndns.org
Sun Jan 20 08:53:54 EST 2008
On Fri, 18 Jan 2008 09:15:58 -0800 (PST), David Sanders <dpsanders at gmail.com> wrote:
> Hi,
>
> I am processing large files of numerical data. Each line is either a
> single (positive) integer, or a pair of positive integers, where the
> second represents the number of times that the first number is
> repeated in the data -- this is to avoid generating huge raw files,
> since one particular number is often repeated in the data generation
> step.
>
> My question is how to process such files efficiently to obtain a
> frequency histogram of the data (how many times each number occurs in
> the data, taking into account the repetitions). My current code is as
> follows:
...
> The data files are large (~100 million lines), and this code takes a
> long time to run (compared to just doing wc -l, for example).
I don't know if you are in control of the *generation* of data, but
I think it's often better and more convenient to pipe the raw data
through 'gzip -c' (i.e. gzip-compress it before it hits the disk)
than to figure out a smart application-specific compression scheme.
Maybe if you didn't have a homegrown file format, there would have
been readymade histogram utilities? Or at least a good reason to
spend the time writing an optimized C version.
/Jorgen
--
// Jorgen Grahn <grahn@ Ph'nglui mglw'nafh Cthulhu
\X/ snipabacken.se> R'lyeh wgah'nagl fhtagn!
More information about the Python-list
mailing list