"Faster" I/O in a script

Wed Jun 4 07:51:26 EDT 2008

Gary Herron wrote:
> miller.paul.w at gmail.com wrote:
>> On Jun 2, 2:08 am, "kalakouentin" <kalakouen... at yahoo.com> wrote:
>>
>>  
>>>  Do you know a way to actually load my data in a more
>>> "batch-like" way so I will avoid the constant line by line reading?
>>>     
>>
>> If your files will fit in memory, you can just do
>>
>> text = file.readlines()
>>
>> and Python will read the entire file into a list of strings named
>> 'text,' where each item in the list corresponds to one 'line' of the
>> file.
>>   
> 
> No that won't help.  That has to do *all* the same work (reading blocks 
> and finding line endings) as the iterator PLUS allocate and build a list.
> Better to just use the iterator.
> 
> for line in file:
>  ...

Actually this *can* be much slower.  Suppose I want to search a file to 
see if a substring is present.

st = "some substring that is not actually in the file"
f = <50 MB log file>

Method 1:

for i in file(f):
     if st in i:
         break

--> 0.472416 seconds

Method 2:

Read whole file:

fh = file(f)
rl = fh.read()
fh.close()

--> 0.098834 seconds

"st in rl" test --> 0.037251 (total: .136 seconds)

Method 3:

mmap the file:

mm = mmap.mmap(fh.fileno(), 0, mmap.MAP_SHARED, mmap.PROT_READ)
"st in mm" test --> 3.589938 (<-- see my post the other day)

mm.find(st) --> 0.186895

Summary:

If you can afford the memory, it can be more efficient (more than 3 
times faster in this example) to read the file into memory and process 
it at once (if possible).

Mmapping the file and processing it at once is roughly as fast (I didnt 
measure the difference carefully), but has the advantage that if there 
are parts of the file you do not touch you don't fault them into memory. 
  You could also play more games and mmap chunks at a time to limit the 
memory use (but you'd have to be careful with mmapping that doesn't 
match record boundaries).

Kris