Efficient grep using Python?

Fri Dec 17 07:21:08 EST 2004

Christos TZOTZIOY Georgiou wrote:
> On Thu, 16 Dec 2004 14:28:21 +0000, rumours say that P at draigBrady.com
>>I challenge you to a benchmark :-)
> 
> 
> Well, the numbers I provided above are almost meaningless with such a
> small set (and they easily could be reverse, I just kept the
> convenient-to-me first run :).  Do you really believe that sorting three
> files and then scanning their merged output counting duplicates is
> faster than scanning two files (and doing lookups during the second
> scan)?
> 
> $ python
> Python 2.3.3 (#1, Aug 31 2004, 13:51:39)
> [GCC 3.3.3 (SuSE Linux)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> 
>>>>x=open('/usr/share/dict/words').readlines()
>>>>len(x)
> 
> 45378
> 
>>>>import random
>>>>random.shuffle(x)
>>>>open("/tmp/A", "w").writelines(x)
>>>>random.shuffle(x)
>>>>open("/tmp/B", "w").writelines(x[:1000])
>>>>
> 
> $ time sort A B B | uniq -u >/dev/null
> 
> real    0m0.311s
> user    0m0.315s
> sys     0m0.008s
> $ time grep -Fvf B A >/dev/null
> 
> real    0m0.067s
> user    0m0.064s
> sys     0m0.003s
> 
> (Yes, I cheated by adding the F (for no regular expressions) flag :)

Also you only have 1000 entries in B!
Try it again with all entries in B also ;-)
Remember the original poster had 100K entries!

>>>and finally destroys original line
>>>order (should it be important).
>>
>>true
> 
> That's our final agreement :)

Note the order is trivial to restore with a
"decorate-sort-undecorate" idiom.

-- 
Pádraig Brady - http://www.pixelbeat.org
--