HTML Regular Expression

Michael Morrison borlak at home.com
Wed Jun 14 08:15:10 EDT 2000


This is my first regular expression, and it's working okay, but I was
wondering if there was a better way to do this.

I read in an HTML document using urllib.urlopen, which returns the document
complete with tabs.  I don't want the tabs.  So I made this regular
expression:

   reobj = re.compile(r"(<.*?>|&#\d.{1,3})")
   text = reobj.sub('', text)
   text = string.replace(text, '\012', '')

How does that look?  The <.*?> is for the tabs, the &#/d.{1,3} is for the
·  characters and the like.  I was told all the &# codes are 3 digits,
and that is the only way I could get the regular expression to erase 'em.

One last question, a pretty newbie one :)  I run a simple loop to keep my
program going...

while going:
    do_whatever
    time.sleep(2)

I do this in pythonwin, but it freezes up in windows, and I can't use ctrl-d
or ctrl-c or anything to stop it without killing it.  Is there a better
generic loop?

Thanks for any help and comments!






More information about the Python-list mailing list