[Tutor] Regex troubles

Daniel Yoo dyoo@hkn.eecs.berkeley.edu
Sat, 21 Apr 2001 02:52:33 -0700 (PDT)


On Sat, 21 Apr 2001, JRicker wrote:

> Ok I'm at wits end here and would love to hear some ideas on what I'm
> doing wrong here.
> 
> I'm trying to match all occurances of this text in a web page:
> 
> <font color="">Some Text
> </font>

As a small note: your problem might be a little easier if you use the
htmlllib module: it contains a parser that understands http
documents.  You can use htmllib to focus on certain tags, like font, and
have it do something in response to those tags.  There's some
documentation on htmllib here:

    http://python.org/doc/current/lib/module-htmllib.html

and if you want, we could cook up a quick example that uses htmllib.



> Some code:
> 
> keywords = "[Foobar|Some]"
> keyword_re = re.compile("<font color=\"\">(.*?" + keywords +
> ".*?)</font>", re.I | re.DOTALL)
> for x in re.findall(keyword_re, apage):
>         print x
> 
> where apage is a web page .read() in.  Now this is returning everything
> between each appearance of <font color=""></font> whether the keywords
> match or not.


Strange; I'm not seeing anything that would cause this to break... wait!  
Ah!

> for x in re.findall(keyword_re, apage):
>         print x


There's the bug!  You meant to write:

    for x in keyword_re.findall(apage):
        print x

instead.  When we "compile" a regular expression, what we get back is a
regular expression object:

    http://python.org/doc/current/lib/re-objects.html

and it's on that object that we do the findall() methods from.  What was
happening before was that we were passing the whole regular expression
object into re.findall(), but that's something that re.findall()'s not
equipped to handle.




> Something else that struck me as odd.  I tried making keywords a single
> word to search for (ie keywords = "Some") and my script crashed giving
> me an error of maximum recursion level reached.  Any ideas what caused
> this?

Most likely, this was a result of the weird interaction caused by passing
a regular expression object into re.findall().  I wouldn't worry about it
too much: see if the correction above fixes your program.


If you have time, play around with htmllib too; it's not too bad, and it
simplifies a lot of the work necessary to parse out HTML files.