Working with Huge Text Files

Fri Mar 18 21:36:35 EST 2005

I did some similar stuff way back about 12-15 years ago -- in 640k 
MS-DOS with gigabyte files on 33 MHz machines.  I got good performance,
able to bring up any record out of 10 million or so on the screen in a
couple of seconds (not using Python, but that should not make much 
difference, maybe even some things in Python would make it work better.)

Even though my files were text, I read them as random-access binary 
files.  You need to be able to dive in at an arbitrary point in the 
file, read a chunk of data, split it up into lines, discarding any 
partial lines at the beginning and end, pull out the keys and see where 
you are. Even with a gigabyte of file, if you are reading a decent size 
chunk, you can binary search down to the spot you want in 15-20 tries or 
so.  That's the first time, but after that you've got a better idea 
where to look.  Use a dictionary to save the information from each chunk 
to give you an index to get a headstart on the next search.  If you can 
keep 10k to 100k entries in your index, you can do 1000's of searches or 
so before you even have to worry about having too many index entries.

I did learn that on 32-bit hardware, doing a binary search of a file 
over a gigabyte will fail if you calculate the next place to look as 
(a+b)/2, because a+b can be more than 2GB and overflow.  You gotta do
(a + (b-a)/2)

Al