HTML Parser

Sun Dec 31 11:43:59 EST 2000

When parsing large documents I found this to be much faster.

>>> import re, time
>>> s="xxx<dog a='a'>yyyy"
>>> s1=s*1000000
>>> t1=time.time();res=re.findall("<[^>]*?>", s1);time.time()-t1
45.437000036239624
>>> t1=time.time();res=re.findall("<.*?>", s1);time.time()-t1
82.343999981880188
>>>

Although to my surprise the patched version of _sre.pyd I use inverts these
results.

>>> t1=time.time();res=re.findall("<[^>]*?>", s1);time.time()-t1
75.577999949455261
>>> t1=time.time();res=re.findall("<.*?>", s1);time.time()-t1
49.296999931335449

The patch looks for ".*?" and optimizes out the recursion.

Both versions run much faster using sre than pre.
Here's pre
>>> t1=time.time();res=re.findall("<[^>]*?>", s1);time.time()-t1
83.828999996185303
>>> t1=time.time();res=re.findall("<.*?>", s1);time.time()-t1
123.43700003623962
>>>

--Darrell

"Greg Jorgensen" <gregj at pobox.com> wrote:
> "Greg Jorgensen" <gregj at pobox.com> wrote:
>
> > rx = re.compile('(<.*?>)', re.MULTILINE)
>
> Oops -- that should be:
>
>     rx = re.compile('(<.*?>)', re.DOTALL)
>
> That makes the . match the newlines. You need that because HTML tags can
> span lines.
>