How do I enter/receive webpage information?
John J. Lee
jjl at pobox.com
Sat Feb 5 18:10:08 EST 2005
Jorgen Grahn <jgrahn-nntq at algonet.se> writes:
[...]
> - subclassed sgmllib.SGMLParser once for each kind of page I expected to
> receive. This class knew how to pull the information from a HTML document,
> provided it looked as I expected it to. Very tedious work. It can be easier
> and safer to just use module re in some cases.
[...]
BeautifulSoup is often recommended (never tried it myself).
Remember HTMLtidy and its offshoots (eg. tidylib, mxTidy) are
available for cleaning horrid HTML while-u-scrape, too.
Alternatively, some people swear by automating Internet Explorer;
other people would rather be hit on the head with a ball-peen hammer
(not only the MS-haters)...
John
More information about the Python-list
mailing list