building an index for large text files for fast access

Neil Hodgson nyamatongwe+thunder at gmail.com
Tue Jul 25 02:53:03 EDT 2006


Yi Xing:

> Since different lines have different size, I 
> cannot use seek(). So I am thinking of building an index for the file 
> for fast access. Can anybody give me some tips on how to do this in 
> Python?

    It depends on the size of the files and the amount of memory and 
disk you may use. First suggestion would be an in-memory array.array of 
64 bit integers made from 2 'I' entries with each 64 bit integer 
pointing to the start of a set of n lines. Then to find a particular 
line number p you seek to a[p/n] and then read over p%n lines. The 
factor 'n' is adjusted to fit your data into memory. If this uses too 
much memory or scanning the file to build the index each time uses too 
much time then you can use an index file with the same layout instead.

    Neil



More information about the Python-list mailing list