Efficient grep using Python?

Wed Dec 15 11:02:05 EST 2004

>>>>> "sf" == sf  <sf at sf.sf> writes:

    sf> Just started thinking about learning python.  Is there any
    sf> place where I can get some free examples, especially for
    sf> following kind of problem ( it must be trivial for those using
    sf> python)

    sf> I have files A, and B each containing say 100,000 lines (each
    sf> line=one string without any space)

    sf> I want to do

    sf> " A - (A intersection B) "

    sf> Essentially, want to do efficient grep, i..e from A remove
    sf> those lines which are also present in file B.

If you're only talking about 100K lines or so, and you have a
reasonably modern computer, you can do this all in memory.  If order
doesn't matter (it probably does) you can use a set to get all the
lines in file B that are not in A

    from sets import Set
    A = Set(file('test1.dat').readlines())
    B = Set(file('test2.dat').readlines())
    print B-A

To preserve order, you should use a dictionary that maps lines to line
numbers.  You can later use these numbers to sort

    A = dict([(line, num) for num,line in enumerate(file('test1.dat'))])
    B = dict([(line, num) for num,line in enumerate(file('test2.dat'))])

    keep = [(num, line) for line,num in B.items() if not A.has_key(line)]
    keep.sort()
    for num, line in keep:
       print line,

Now someone else will come along and tell you all this functionality
is already in the standard library.  But it's always fun to hack this
out yourself once because python makes such things so damned easy.

JDH