How to remove subset from a file efficiently?

Steve Holden steve at holdenweb.com
Thu Jan 12 13:48:00 EST 2006


Fredrik Lundh wrote:
> "fynali" wrote:
> 
> 
>>>Objective: to remove the numbers present in barred-list from the
>>>PSPfile.
>>>
>>>    $ ls -lh PSP0000320.dat CBR0000319.dat
>>>    ...  56M Dec 28 19:41 PSP0000320.dat
>>>    ... 8.6M Dec 28 19:40 CBR0000319.dat
>>>
>>>   $ wc -l PSP0000320.dat CBR0000319.dat
>>>     4,462,603 PSP0000320.dat
>>>       693,585 CBR0000319.dat
>>>
>>>I wrote the following in python to do it:
>>>
>>>    #: c01:rmcommon.py
>>>    barredlist = open(r'/home/sjd/python/wip/CBR0000319.dat', 'r')
>>>    postlist = open(r'/home/sjd/python/wip/PSP0000320.dat', 'r')
>>>    outfile = open(r'/home/sjd/python/wip/PSP-CBR.dat', 'w')
>>>
>>>    # reading it all in one go, so as to avoid frequent disk accesses
>>>    (assume machine has plenty memory)
>>>    barredlist.read()
>>>    postlist.read()
>>>
>>>    #
>>>    for number in postlist:
>>>            if number in barrlist:
>>>                    pass
>>>            else:
>>>                    outfile.write(number)
>>>
>>>    barredlist.close(); postlist.close(); outfile.close()
>>>    #:~
>>>
>>>The above code simply takes too long to complete.
> 
> 
> the above code doesn't even run.
> 
> (why is it that nobody remembers how to use cut and paste these
> days?  has it perhaps been banned in some part of the world, with-
> out me noticing)
> 
> this might work a little better:
> 
>         barred = set(open('/home/sjd/python/wip/CBR0000319.dat'))
> 
>         infile = open('/home/sjd/python/wip/PSP0000320.dat')
>         outfile = open('/home/sjd/python/wip/PSP-CBR.dat', 'w')
> 
>         for number in infile:
>             if number not in barred:
>                 outfile.write(number)
> 
> if you feel adventurous, you can replace the for/if loop with
> 
>         outfile.writelines(number for number in infile if number not in barred)
> 
> :::
> 
> tim wrote:
> 
> 
>>It should be quicker to do this
>>
>>   #
>>   for number in postlist:
>>           if not number in barrlist:
>>                   outfile.write(number)
>>
>>
>>and quicker doing this
>>
>>   #
>>numbers =  [number for number in postlist if not number in barrlist]
>>outfile.write(''.join(numbers))
> 
> 
> looks like premature non-optimization to me...
> 
It might be quicker to establish a dict whose keys are the barred 
numbers and use that, rather than a list, to determine whether the input 
numbers should make it through.

regards
  Steve
-- 
Steve Holden       +44 150 684 7255  +1 800 494 3119
Holden Web LLC                     www.holdenweb.com
PyCon TX 2006                  www.python.org/pycon/




More information about the Python-list mailing list