RE Module Performance

Mon Jul 15 06:06:06 EDT 2013

On 07/14/2013 02:17 PM, 88888 Dihedral wrote:
> On Saturday, July 13, 2013 1:37:46 PM UTC+8, Steven D'Aprano wrote:
>> On Fri, 12 Jul 2013 13:58:29 -0400, Devyn Collier Johnson wrote:
>>
>>
>>
>>> I plan to spend some time optimizing the re.py module for Unix systems.
>>> I would love to amp up my programs that use that module.
>>
>>
>> In my experience, often the best way to optimize a regex is to not use it
>>
>> at all.
>>
>>
>>
>> [steve at ando ~]$ python -m timeit -s "import re" \
>>
>>> -s "data = 'a'*100+'b'" \
>>> "if re.search('b', data): pass"
>> 100000 loops, best of 3: 2.77 usec per loop
>>
>>
>>
>> [steve at ando ~]$ python -m timeit -s "data = 'a'*100+'b'" \
>>
>>> "if 'b' in data: pass"
>> 1000000 loops, best of 3: 0.219 usec per loop
>>
>>
>>
>> In Python, we often use plain string operations instead of regex-based
>>
>> solutions for basic tasks. Regexes are a 10lb sledge hammer. Don't use
>>
>> them for cracking peanuts.
>>
>>
>>
>>
>>
>>
>>
>> -- 
>>
>> Steven
> OK, lets talk about the indexed search algorithms of
> a character streamor strig which can be buffered and
> indexed randomly for RW operations but faster in sequential
> block RW operations after some pre-processing.
>
> This was solved long time ago in the suffix array or
> suffix tree part and summarized in the famous BWT paper in 199X.
>
> Do we want volunteers to speed up
> search operations in the string module in Python?
It would be nice if someone could speed it up.