How to remove subset from a file efficiently?

Thu Jan 12 12:04:21 EST 2006

Hi all,

I have two files:

  - PSP0000320.dat (quite a large list of mobile numbers),
  - CBR0000319.dat (a subset of the above, a list of barred bumbers)

    # head PSP0000320.dat CBR0000319.dat
    ==> PSP0000320.dat <==
    96653696338
    96653766996
    96654609431
    96654722608
    96654738074
    96655697044
    96655824738
    96656190117
    96656256762
    96656263751

    ==> CBR0000319.dat <==
    96651131135
    96651131135
    96651420412
    96651730095
    96652399117
    96652399142
    96652399142
    96652399142
    96652399160
    96652399271

Objective: to remove the numbers present in barred-list from the
PSPfile.

    $ ls -lh PSP0000320.dat CBR0000319.dat
    ...  56M Dec 28 19:41 PSP0000320.dat
    ... 8.6M Dec 28 19:40 CBR0000319.dat

    $ wc -l PSP0000320.dat CBR0000319.dat
     4,462,603 PSP0000320.dat
       693,585 CBR0000319.dat

I wrote the following in python to do it:

    #: c01:rmcommon.py
    barredlist = open(r'/home/sjd/python/wip/CBR0000319.dat', 'r')
    postlist = open(r'/home/sjd/python/wip/PSP0000320.dat', 'r')
    outfile = open(r'/home/sjd/python/wip/PSP-CBR.dat', 'w')

    # reading it all in one go, so as to avoid frequent disk accesses
(assume machine has plenty memory)
    barredlist.read()
    postlist.read()

    #
    for number in postlist:
            if number in barrlist:
                    pass
            else:
                    outfile.write(number)

    barredlist.close(); postlist.close(); outfile.close()
    #:~

The above code simply takes too long to complete.  If I were to do a
diff -y PSP0000320.dat CBR0000319.dat, catch the '<' & clean it up with
sed -e 's/\([0-9]*\) *</\1/' > PSP-CBR.dat it takes <4 minutes to
complete.

I wrote the following in bash to do the same:

    #!/bin/bash

    ARGS=2

    if [ $# -ne $ARGS ]     # takes two arguments
    then
        echo; echo "Usage: `basename $0` {PSPfile} {CBRfile}"
        echo; echo "    eg.: `basename $0` PSP0000320.dat
CBR0000319.dat"; echo;
        echo "NOTE: first argument: PSP file, second: CBR file";
        echo "      this script _does_ no_ input validation!"
        exit 1
    fi;

    # fix prefix; cost: 12.587 secs
    cat $1 | sed -e 's/^0*/966/' > $1.good
    cat $2 | sed -e 's/^0*/966/' > $2.good

    # sort/save files; for the 4,462,603 lines, cost: 36.589 secs
    sort $1.good > $1.sorted
    sort $2.good > $2.sorted

    # diff -y {PSP} {CBR}, grab the ones in PSPfile; cost: 31.817 secs
    diff -y $1.sorted $2.sorted | grep "<" > $1.filtered

     # remove trailing junk [spaces & <]; cost: 1 min 3 secs
    cat $1.filtered | sed -e 's/\([0-9]*\) *</\1/' > $1.cleaned

    # remove intermediate files, good, sorted, filtered
     rm -f *.good *.sorted *.filtered
    #:~

...but strangely though, there's a discrepancy, the reason for which I
can't figure out!

Needless to say, I'm utterly new to python and my programming skills &
know-how are rudimentary.

Any help will be genuinely appreciated.

--
fynali