Why is regex so slow?

Tue Jun 18 14:10:16 EDT 2013

On 18.06.2013 19:20, Chris Angelico wrote:

> Yeah, I'd try that against 3.3 before opening a performance bug. Also,
> it's entirely possible that performance is majorly different in 3.x
> anyway, on account of strings being Unicode. Definitely merits another
> look imho.

Hmmm, at least Python 3.2 seems to have the same issue. I generated test
data with:

#!/usr/bin/python3
import random
random.seed(0)
f = open("error.log", "w")
for i in range(1500000):
	q = random.randint(0, 99)
	if q == 0:
		print("ENQUEUEING: /listen/ fhsduifhsd uifhuisd hfuisd hfuihds
iufhsd", file = f)
	else:
		print("fiosdjfoi sdmfio sdmfio msdiof msdoif msdoimf oisd mfoisdm f",
file = f)

Resulting file has a size of 91530018 and md5 of
2d20c3447a0b51a37d28126b8348f6c5 (just to make sure we're on the same
page because I'm not sure the PRNG is stable across Python versions).

Testing with:

#!/usr/bin/python3
import re
pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0
for line in open('error.log'):
#	if 'ENQ' not in line:
#		continue
	m = pattern.search(line)
	if m:
		count += 1
print(count)

The pre-check version is about 42% faster in my case (0.75 sec vs. 1.3
sec). Curious. This is Python 3.2.3 on Linux x86_64.

Regards,
Johannes

-- 
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?
> Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1 at speranza.aioe.org>