CSV performance

Tim Chase python.list at tim.thechases.com
Mon Apr 27 10:51:56 EDT 2009


> I have tried running it just on the csv read:
...
> print "finished: %f.2" % (t1 - t0)

I presume you wanted "%.2f" here. :)

> $ ./largefilespeedtest.py
> working at file largefile.txt
> finished: 3.860000.2

So just the CSV processing of the file takes just shy of 4 
seconds and you said that just the pure file-read took about a 
second, so that leaves about 3 seconds for CSV processing (or 
about 1/3 of the total runtime).  In your code example in your 
2nd post (with the timing in it), it looks like it took 15+ 
seconds, meaning the csv code is a mere 1/5 of the runtime.  I 
also notice that you're reading the file once to find the length, 
and reading again to process it.

> The csv files are a chromosome name,
> a coordinate and a data point, like this:
> 
> chr1	3754914	1.19828
> chr1	3754950	1.56557
> chr1	3754982	1.52371

Depending on the simplicity of the file-format (assuming nothing 
like spaces/tabs in the chromosome name, which your dictionary 
seems to indicate is the case), it may be faster to use .split() 
to do the work:

   for line in file(afile):
      a,b,c = line.rstrip('\n\r').split()

The csv module does a lot of smart stuff that it looks like you 
may not need.

However, you're still only cutting from that 3-second subset of 
your total time.  Focusing on the "filing it into very simple 
data structures" will likely net you greater improvements. I 
don't have much experience with numpy, so I can't offer much to 
help.  However, rather than reading the file twice, you might try 
a general heuristic, assuming lines are no longer than N 
characters (they look like they're each 20 chars + a newline) and 
then using "filesize/N" to estimate an adequately sized array. 
Using stat() on a file to get its size will be a heckuva lot 
faster than reading the whole file.  I also don't know the 
performance of cStringIO.CString() with lots of appending. 
However, since each write is just a character, you might do well 
to use the array module (unless numpy also has char-arrays) to 
preallocate n chars just like you do with your ints and floats:

   chromeio[count] = chrommap[chrom]
   coords[count] = coord
   points[count] = point
   count += 1

Just a few ideas to try.

-tkc








More information about the Python-list mailing list