CSV performance

Mon Apr 27 10:59:20 EDT 2009

psaffrey at googlemail.com wrote:

> Thanks for your replies. Many apologies for not including the right
> information first time around. More information is below.
> 
> I have tried running it just on the csv read:

> $ ./largefilespeedtest.py
> working at file largefile.txt
> finished: 3.860000.2
> 
> 
> A tiny bit of background on the final application: this is biological
> data from an affymetrix platform. The csv files are a chromosome name,
> a coordinate and a data point, like this:
> 
> chr1  3754914 1.19828
> chr1  3754950 1.56557
> chr1  3754982 1.52371
> 
> In the "simple data structures" cod below, I do some jiggery pokery
> with the chromosome names to save me storing the same string millions
> of times.

> $ ./affyspeedtest.py
> reading affy file largefile.txt
> finished: 15.540000.2

It looks like most of the time is not spent in the csv.reader().
Here's an alternative way to read your data:

rows = fh.read().split()
coords = numpy.array(map(int, rows[1::3]), dtype=int)
points = numpy.array(map(float, rows[2::3]), dtype=float)
chromio.writelines(map(chrommap.__getitem__, rows[::3]))

Do things improve if you simplify your code like that?

Peter