speeding up reading files (possibly with cython)

Sun Mar 8 08:53:27 EDT 2009

Steven D'Aprano wrote:
> per wrote:
>> currently, this is very slow in python, even if all i do is break up
>> each line using split()
******************
>> and store its values in a dictionary, 
******************
>> indexing by one of the tab separated values in the file.
> 
> If that's the problem, the solution is: get more memory.

Steven caught the "and store its values in a dictionary" (which I 
missed previously and accentuated in the above quote).  The one 
missing pair of factors you omitted:

   1) how many *lines* are in this file (or what's the average 
line-length).  You can use the following code both to find out 
how many lines are in the file, and to see how long it takes 
Python to skim through an 800 meg file just in terms of file-I/O:

     i = 0
     for line in file('in.txt'):
       i += 1
     print "%i lines" % i

   2) how much overlap/commonality is there in the keys between 
lines?  Does every line create a new key, in which case you're 
adding $LINES keys to your dictionary?  or do some percentage of 
lines overwrite entries in your dictionary with new values? 
After one of your slow runs, issue a

     print len(my_dict)

   to see how many keys are in the final dict.

If you end up having millions of keys into your dict, you may be 
able to use the "bdb" module to store your dict on-disk and save 
memory.  Doing access to *two* files may not get you great wins 
in speed, but you at least won't be thrashing your virtual memory 
with a huge dict, so performance in the rest of your app may not 
experience similar problems due to swapping into virtual memory. 
  This has the added advantage that, if your input file doesn't 
change, you can simply reuse the bdb database/dict file without 
the need to rebuild its contents.

-tkc