[2.5] Regex doesn't support MULTILINE?

Jay Loden python at jayloden.com
Sun Jul 22 01:18:22 EDT 2007


Gilles Ganault wrote:
> Problem is, when I add re.DOTLINE, the search takes less than a second
> for a 500KB file... and about 1mn30 for a file that's 1MB, with both
> files holding similar contents.
> 
> Why such a huge difference in performance?
> 
> ========= Using Re =============
> import re
> import time
> 
> pattern = "<span class=.?defaut.?>(\d+:\d+).*?</span>"
> 
> pages = ["500KB.html","1MB.html"]
> 
> #Veeeeeeeeeeery slow when parsing 1MB file !
> p = re.compile(pattern,re.IGNORECASE|re.MULTILINE|re.DOTALL)
> #p = re.compile(pattern,re.IGNORECASE|re.MULTILINE)
> 
> for page in pages:
> 	f = open(page, "r") 
> 	response = f.read() 
> 	f.close()
> 
> 	start = time.strftime("%H:%M:%S", time.localtime(time.time()))
> 	print "before findall @ " + start
> 	packed = p.findall(response)
> 	if packed:
> 		for item in packed:
> 			print item
> ===========================
> 

I don't know if it'll result in a performance difference, but since you're just saving the result of re.findall() to a variable in order to iterate over it, you might as well just use re.finditer() instead:

	for item in p.finditer(response):
		print item

At least then it can start printing as soon as it hits a match instead of needing to find all the matches first.

-Jay



More information about the Python-list mailing list