Efficient grep using Python?
P at draigBrady.com
P at draigBrady.com
Fri Dec 17 07:21:08 EST 2004
Christos TZOTZIOY Georgiou wrote:
> On Thu, 16 Dec 2004 14:28:21 +0000, rumours say that P at draigBrady.com
>>I challenge you to a benchmark :-)
>
>
> Well, the numbers I provided above are almost meaningless with such a
> small set (and they easily could be reverse, I just kept the
> convenient-to-me first run :). Do you really believe that sorting three
> files and then scanning their merged output counting duplicates is
> faster than scanning two files (and doing lookups during the second
> scan)?
>
> $ python
> Python 2.3.3 (#1, Aug 31 2004, 13:51:39)
> [GCC 3.3.3 (SuSE Linux)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>
>>>>x=open('/usr/share/dict/words').readlines()
>>>>len(x)
>
> 45378
>
>>>>import random
>>>>random.shuffle(x)
>>>>open("/tmp/A", "w").writelines(x)
>>>>random.shuffle(x)
>>>>open("/tmp/B", "w").writelines(x[:1000])
>>>>
>
> $ time sort A B B | uniq -u >/dev/null
>
> real 0m0.311s
> user 0m0.315s
> sys 0m0.008s
> $ time grep -Fvf B A >/dev/null
>
> real 0m0.067s
> user 0m0.064s
> sys 0m0.003s
>
> (Yes, I cheated by adding the F (for no regular expressions) flag :)
Also you only have 1000 entries in B!
Try it again with all entries in B also ;-)
Remember the original poster had 100K entries!
>>>and finally destroys original line
>>>order (should it be important).
>>
>>true
>
> That's our final agreement :)
Note the order is trivial to restore with a
"decorate-sort-undecorate" idiom.
--
Pádraig Brady - http://www.pixelbeat.org
--
More information about the Python-list
mailing list