speeding up reading files (possibly with cython)

Sat Mar 7 19:19:55 EST 2009

> i have a program that essentially loops through a textfile file thats
> about 800 MB in size containing tab separated data... my program
> parses this file and stores its fields in a dictionary of lists.
> 
> for line in file:
>   split_values = line.strip().split('\t')
>   # do stuff with split_values
> 
> currently, this is very slow in python, even if all i do is break up
> each line using split() and store its values in a dictionary, indexing
> by one of the tab separated values in the file.

I'm not sure what the situation is, but I regularly skim through 
tab-delimited files of similar size and haven't noticed any 
problems like you describe.  You might try tweaking the optional 
(and infrequently specified) bufsize parameter of the 
open()/file() call:

   bufsize = 4 * 1024 * 1024 # buffer 4 megs at a time
   f = file('in.txt', 'r', bufsize)
   for line in f:
     split_values = line.strip().split('\t')
     # do stuff with split_values

If not specified, you're at the mercy of the system-default 
(perhaps OS specific?).  You can read more at[1] along with the 
associated warning about setvbuf()

-tkc

[1]
http://docs.python.org/library/functions.html#open