RE Module Performance

Sun Jul 14 14:17:06 EDT 2013

On Saturday, July 13, 2013 1:37:46 PM UTC+8, Steven D'Aprano wrote:
> On Fri, 12 Jul 2013 13:58:29 -0400, Devyn Collier Johnson wrote:
> 
> 
> 
> > I plan to spend some time optimizing the re.py module for Unix systems.
> 
> > I would love to amp up my programs that use that module.
> 
> 
> 
> In my experience, often the best way to optimize a regex is to not use it 
> 
> at all.
> 
> 
> 
> [steve at ando ~]$ python -m timeit -s "import re" \
> 
> > -s "data = 'a'*100+'b'" \
> 
> > "if re.search('b', data): pass"
> 
> 100000 loops, best of 3: 2.77 usec per loop
> 
> 
> 
> [steve at ando ~]$ python -m timeit -s "data = 'a'*100+'b'" \
> 
> > "if 'b' in data: pass"
> 
> 1000000 loops, best of 3: 0.219 usec per loop
> 
> 
> 
> In Python, we often use plain string operations instead of regex-based 
> 
> solutions for basic tasks. Regexes are a 10lb sledge hammer. Don't use 
> 
> them for cracking peanuts.
> 
> 
> 
> 
> 
> 
> 
> -- 
> 
> Steven

OK, lets talk about the indexed search algorithms of 
a character streamor strig which can be buffered and
indexed randomly for RW operations but faster in sequential 
block RW operations after some pre-processing.

This was solved long time ago in the suffix array or 
suffix tree part and summarized in the famous BWT paper in 199X.

Do we want volunteers to speed up 
search operations in the string module in Python?