Efficient grep using Python?

Thu Dec 16 11:31:53 EST 2004

On Thu, 16 Dec 2004 14:28:21 +0000, rumours say that P at draigBrady.com
might have written:

[sf]
>>>>Essentially, want to do efficient grep, i..e from A remove those lines which
>>>>are also present in file B.

[p at draig]
>>>You could implement elegantly using the new sets feature
>>>For reference here is the unix way to do it:
>>>
>>>sort a b b | uniq -u

[Christos]
>> No, like I just wrote in another post, he wants
>> 
>> $ grep -vf B A
>> 
>> I think that
>> 
>> $ sort A B B | uniq -u
>> 
>> can be abbreviated to
>> 
>> $ sort -u A B B
>> 
>> which is the union rather than the intersection of the files

[P at draig]
>wrong. Notice the -u option to uniq.
>http://marc.free.net.ph/message/20021101.043138.1bc24964.html

I see your point.  That's a new to me use of uniq, since I started using
Unices long before GNU versions of the tools, but then, I might have
missed the -u option.

$ cat A
aa
ss
dd
ff
gg
hh
jj
kk
$ cat B
aa
dd
gg
jj
pp
$ time sort A B B | uniq -u
ff
hh
kk
ss

real    0m0.004s
user    0m0.004s
sys     0m0.004s
$ time grep -vf B A
ss
ff
hh
kk

real    0m0.003s
user    0m0.000s
sys     0m0.003s

So I stand corrected that your solution does *not* give the union.

>> wastes some time by considering B twice
>
>I challenge you to a benchmark :-)

Well, the numbers I provided above are almost meaningless with such a
small set (and they easily could be reverse, I just kept the
convenient-to-me first run :).  Do you really believe that sorting three
files and then scanning their merged output counting duplicates is
faster than scanning two files (and doing lookups during the second
scan)?

$ python
Python 2.3.3 (#1, Aug 31 2004, 13:51:39)
[GCC 3.3.3 (SuSE Linux)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> x=open('/usr/share/dict/words').readlines()
>>> len(x)
45378
>>> import random
>>> random.shuffle(x)
>>> open("/tmp/A", "w").writelines(x)
>>> random.shuffle(x)
>>> open("/tmp/B", "w").writelines(x[:1000])
>>>
$ time sort A B B | uniq -u >/dev/null

real    0m0.311s
user    0m0.315s
sys     0m0.008s
$ time grep -Fvf B A >/dev/null

real    0m0.067s
user    0m0.064s
sys     0m0.003s

(Yes, I cheated by adding the F (for no regular expressions) flag :)

>> and finally destroys original line
>> order (should it be important).
>
>true

That's our final agreement :)
-- 
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...