Efficient grep using Python? [OT]

Fri Dec 17 08:21:45 EST 2004

On Fri, 17 Dec 2004 12:21:08 +0000, rumours say that P at draigBrady.com
might have written:

[snip some damn lie aka "benchmark"]

[me]
>> (Yes, I cheated by adding the F (for no regular expressions) flag :)
>
>Also you only have 1000 entries in B!
>Try it again with all entries in B also ;-)
>Remember the original poster had 100K entries!

Well, that's the closest I can do:

$ py
Python 2.4c1 (#3, Nov 26 2004, 23:39:44)
[GCC 3.3.3 (SuSE Linux)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys; sys.ps1='.>>'
.>> alist=[line.strip() for line in open('/usr/share/dict/words')]
.>> words=set()
.>> for word in alist:
...     words.add(word + '\n')
...     words.add(word[::-1] + '\n')
...
.>> len(words)
90525
.>> words=list(words)
.>> open('/tmp/A', 'w').writelines(words)
.>> import random; random.shuffle(words)
.>> open('/tmp/B', 'w').writelines(words[:90000])
.>>
$ time sort A B B | uniq -u >/dev/null

real    0m2.408s
user    0m2.437s
sys     0m0.037s
$ time grep -Fvf B A >/dev/null

real    0m1.208s
user    0m1.161s
sys     0m0.035s

What now?-)

Mind you, I only replied in the first place because you wrote (my
emphasis) "...here is *the* unix way..." and it's the bad days of the
month (not mine, actually, but I suffer along...)

>>>>and finally destroys original line
>>>>order (should it be important).
>>>
>>>true
>> 
>> That's our final agreement :)
>
>Note the order is trivial to restore with a
>"decorate-sort-undecorate" idiom.

Using python or unix tools (eg 'paste -d', 'sort -k', 'cut -d')?
Because the python way has been already discussed by Friedrik, John and
Tim, and the unix way gets overly complicated (aka non-trivial) if DSU
is involved.

BTW, the following occurred to me:

tzot at tril/tmp
$ cat >A
aa
ss
dd
ff
gg
hh
jj
kk
ll
aa
tzot at tril/tmp
$ cat >B
ss
ff
hh
kk
tzot at tril/tmp
$ sort A B B | uniq -u
dd
gg
jj
ll
tzot at tril/tmp
$ grep -Fvf B A
aa
dd
gg
jj
ll
aa

Note that 'aa' is contained twice in the A file (to be filtered by B).
So our methods do not produce the same output.  As far as the OP wrote:

>Essentially, want to do efficient grep, i..e from A remove those lines which
>are also present in file B.

grep is the unix way to go for both speed and correctness.

I would call this issue a dead horse.
-- 
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...