HTML Parser
Darrell
news at dorb.com
Sun Dec 31 11:43:59 EST 2000
When parsing large documents I found this to be much faster.
>>> import re, time
>>> s="xxx<dog a='a'>yyyy"
>>> s1=s*1000000
>>> t1=time.time();res=re.findall("<[^>]*?>", s1);time.time()-t1
45.437000036239624
>>> t1=time.time();res=re.findall("<.*?>", s1);time.time()-t1
82.343999981880188
>>>
Although to my surprise the patched version of _sre.pyd I use inverts these
results.
>>> t1=time.time();res=re.findall("<[^>]*?>", s1);time.time()-t1
75.577999949455261
>>> t1=time.time();res=re.findall("<.*?>", s1);time.time()-t1
49.296999931335449
The patch looks for ".*?" and optimizes out the recursion.
Both versions run much faster using sre than pre.
Here's pre
>>> t1=time.time();res=re.findall("<[^>]*?>", s1);time.time()-t1
83.828999996185303
>>> t1=time.time();res=re.findall("<.*?>", s1);time.time()-t1
123.43700003623962
>>>
--Darrell
"Greg Jorgensen" <gregj at pobox.com> wrote:
> "Greg Jorgensen" <gregj at pobox.com> wrote:
>
> > rx = re.compile('(<.*?>)', re.MULTILINE)
>
> Oops -- that should be:
>
> rx = re.compile('(<.*?>)', re.DOTALL)
>
> That makes the . match the newlines. You need that because HTML tags can
> span lines.
>
More information about the Python-list
mailing list