How to remove subset from a file efficiently?

Sat Jan 14 17:29:19 EST 2006

On 13 Jan 2006 23:17:05 -0800, bonono at gmail.com wrote:

>
>fynali wrote:
>> $ cat cleanup_ray.py
>>     #!/usr/bin/python
>>     import itertools
>>
>>     b = set(file('/home/sajid/python/wip/stc/2/CBR0000333'))
>>
>> file('PSP-CBR.dat,ray','w').writelines(itertools.ifilterfalse(b.__contains__,file('/home/sajid/python/wip/stc/2/PSP0000333')))
>>
>>     --
>>     $ time ./cleanup_ray.py
>>
>>     real    0m5.451s
>>     user    0m4.496s
>>     sys     0m0.428s
>>
>> (-: Damn!  That saves a bit more time!  Bravo!
>>
>> Thanks to you Raymond.
>Have you tried the explicit loop variant with psyco ? My experience is
>that psyco is pretty good at optimizing for loop which usually results
>in faster code than even built-in map/filter variant.
>
>Though it would just be 1 or 2 sec difference(given what you already
>have) so may not be important but could be fun.
>
OTOH, when you are dealing with large files and near-optimal simple processing you are
likely to be comparing i/o-bound processes, meaning differences observed
will be symptoms of os and file system performance more than of the algorithms.

An exception is when a slight variation in algorithm can cause a large change
in i/o performance, such as if it causes physical seek and read patterns of disk
access that the OS/file_system and disk interface hardware can't entirely optimize out
with smart buffering etc. Not to mention possible interactions with all the other things
an OS may be doing "simultaneously" switching between things that it accounts for as real/user/sys.

Regards,
Bengt Richter