CSV performance

Wed Apr 29 17:28:55 EDT 2009

On Mon, 27 Apr 2009 23:56:47 +0200, dean <deank at yahoo.com> wrote:
> On Mon, 27 Apr 2009 04:22:24 -0700 (PDT), psaffrey at googlemail.com wrote:
>
>> I'm using the CSV library to process a large amount of data - 28
>> files, each of 130MB. Just reading in the data from one file and
>> filing it into very simple data structures (numpy arrays and a
>> cstringio) takes around 10 seconds. If I just slurp one file into a
>> string, it only takes about a second, so I/O is not the bottleneck. Is
>> it really taking 9 seconds just to split the lines and set the
>> variables?
>
> I assume you're reading a 130 MB text file in 1 second only after OS
> already cashed it, so you're not really measuring disk I/O at all.
>
> Parsing a 130 MB text file will take considerable time no matter what.
> Perhaps you should consider using a database instead of CSV.

Why would that be faster? (Assuming all data is actually read from the
database into data structures in the program, as in the text file
case.)

I am asking because people who like databases tend to overestimate the
time it takes to parse text. (And I guess people like me who prefer
text files tend to underestimate the usefullness of databases.)

/Jorgen

-- 
  // Jorgen Grahn <grahn@        Ph'nglui mglw'nafh Cthulhu
\X/     snipabacken.se>          R'lyeh wgah'nagl fhtagn!