Efficient processing of large nuumeric data file

Sun Jan 20 08:53:54 EST 2008

On Fri, 18 Jan 2008 09:15:58 -0800 (PST), David Sanders <dpsanders at gmail.com> wrote:
> Hi,
>
> I am processing large files of numerical data.  Each line is either a
> single (positive) integer, or a pair of positive integers, where the
> second represents the number of times that the first number is
> repeated in the data -- this is to avoid generating huge raw files,
> since one particular number is often repeated in the data generation
> step.
>
> My question is how to process such files efficiently to obtain a
> frequency histogram of the data (how many times each number occurs in
> the data, taking into account the repetitions).  My current code is as
> follows:

...

> The data files are large (~100 million lines), and this code takes a
> long time to run (compared to just doing wc -l, for example).

I don't know if you are in control of the *generation* of data, but
I think it's often better and more convenient to pipe the raw data
through 'gzip -c' (i.e. gzip-compress it before it hits the disk)
than to figure out a smart application-specific compression scheme.

Maybe if you didn't have a homegrown file format, there would have
been readymade histogram utilities?  Or at least a good reason to
spend the time writing an optimized C version.

/Jorgen

-- 
  // Jorgen Grahn <grahn@        Ph'nglui mglw'nafh Cthulhu
\X/     snipabacken.se>          R'lyeh wgah'nagl fhtagn!