NEWBIE: Removing HTML/JavaScript from a webpage

Miki Tebeka tebeka at cs.bgu.ac.il
Mon Jul 22 04:31:24 EDT 2002


> import urllib
> 
> response = 
> urllib.urlopen('http://movies.go.com/cgi/movielistings/request.dll?ZIPSPECIFIC&zip_code=40004&date=07/20/2002')
> 
> resp = response.read()
> 
> This grabs the movie playtimes for my area. Now, my big question -- how 
> do I remove all of the junk Javascript and HTML? I stole this bit of 
> code from somewhere:
> 
> import re
> split = re.sub("<[^>]*>","",resp)
> 
> But, this only removes the HTML -- the Javascript is still there; and I 
> have no idea on how to modify that so it eliminates the script.
Have a look at http://starship.python.net/crew/tibs/python/html2text

HTH
Miki



More information about the Python-list mailing list