How to remove subset from a file efficiently?

Tim Williams (gmail) tdwdotnet at gmail.com
Thu Jan 12 12:35:57 EST 2006


On 12/01/06, Tim Williams (gmail) <tdwdotnet at gmail.com> wrote:
>
>
>
> On 12 Jan 2006 09:04:21 -0800, fynali <iladijas at gmail.com> wrote:
> >
> > Hi all,
> >
> > I have two files:
> >
> >   - PSP0000320.dat (quite a large list of mobile numbers),
> >   - CBR0000319.dat (a subset of the above, a list of barred bumbers)
> >
> >     # head PSP0000320.dat CBR0000319.dat
> >     ==> PSP0000320.dat <==
> >     96653696338
> >     96653766996
> >     96654609431
> >     96654722608
> >     96654738074
> >     96655697044
> >     96655824738
> >     96656190117
> >     96656256762
> >     96656263751
> >
> >     ==> CBR0000319.dat <==
> >     96651131135
> >     96651131135
> >     96651420412
> >     96651730095
> >     96652399117
> >     96652399142
> >     96652399142
> >     96652399142
> >     96652399160
> >     96652399271
> >
> > Objective: to remove the numbers present in barred-list from the
> > PSPfile.
> >
> >     $ ls -lh PSP0000320.dat CBR0000319..dat
> >     ...  56M Dec 28 19:41 PSP0000320.dat
> >     ... 8.6M Dec 28 19:40 CBR0000319.dat
> >
> >     $ wc -l PSP0000320.dat CBR0000319.dat
> >      4,462,603 PSP0000320.dat
> >        693,585 CBR0000319.dat
> >
> > I wrote the following in python to do it:
> >
> >     #: c01:rmcommon.py
> >     barredlist = open(r'/home/sjd/python/wip/CBR0000319.dat', 'r')
> >     postlist = open(r'/home/sjd/python/wip/PSP0000320.dat', 'r')
> >     outfile = open(r'/home/sjd/python/wip/PSP-CBR.dat', 'w')
> >
> >     # reading it all in one go, so as to avoid frequent disk accesses
> > (assume machine has plenty memory)
> >     barredlist.read()
> >     postlist.read()
> >
> >     #
> >     for number in postlist:
> >             if number in barrlist:
> >                     pass
> >             else:
> >                     outfile.write(number)
> >
> >     barredlist.close(); postlist.close(); outfile.close()
> >     #:~
> >
> > The above code simply takes too long to complete.  If I were to do a
> > diff -y PSP0000320.dat CBR0000319.dat, catch the '<' & clean it up with
> > sed -e 's/\([0-9]*\) *</\1/' > PSP-CBR.dat it takes <4 minutes to
> > complete.
>
>
>
> It should be quicker to do this
>
>    #
>    for number in postlist:
>            if not number in barrlist:
>                    outfile.write(number)
>
>
> and quicker doing this
>
>    #
> numbers =  [number for number in postlist if not number in barrlist]
> c
>

I forgot to add this one

for num in (number for number in postlist if not number in barrlist):
         outfile.write(number)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20060112/0824add3/attachment.html>


More information about the Python-list mailing list