How do I enter/receive webpage information?

John J. Lee jjl at pobox.com
Sat Feb 5 18:10:08 EST 2005


Jorgen Grahn <jgrahn-nntq at algonet.se> writes:
[...]
> - subclassed sgmllib.SGMLParser once for each kind of page I expected to
>   receive. This class knew how to pull the information from a HTML document,
>   provided it looked as I expected it to.  Very tedious work. It can be easier
>   and safer to just use module re in some cases.
[...]

BeautifulSoup is often recommended (never tried it myself).

Remember HTMLtidy and its offshoots (eg. tidylib, mxTidy) are
available for cleaning horrid HTML while-u-scrape, too.

Alternatively, some people swear by automating Internet Explorer;
other people would rather be hit on the head with a ball-peen hammer
(not only the MS-haters)...


John



More information about the Python-list mailing list