NEWBIE: Removing HTML/JavaScript from a webpage
Miki Tebeka
tebeka at cs.bgu.ac.il
Mon Jul 22 04:31:24 EDT 2002
> import urllib
>
> response =
> urllib.urlopen('http://movies.go.com/cgi/movielistings/request.dll?ZIPSPECIFIC&zip_code=40004&date=07/20/2002')
>
> resp = response.read()
>
> This grabs the movie playtimes for my area. Now, my big question -- how
> do I remove all of the junk Javascript and HTML? I stole this bit of
> code from somewhere:
>
> import re
> split = re.sub("<[^>]*>","",resp)
>
> But, this only removes the HTML -- the Javascript is still there; and I
> have no idea on how to modify that so it eliminates the script.
Have a look at http://starship.python.net/crew/tibs/python/html2text
HTH
Miki
More information about the Python-list
mailing list