HTML purifier using BeautifulSoup?

Dan Stromberg strombrg at dcs.nac.uci.edu
Tue Dec 21 13:10:37 EST 2004


Has anyone tried to construct an HTML janitor script using BeautifulSoup?

My situation:

I'm trying to convert a series of web pages from .html to palmdoc format,
using plucker, which is written in python.  The plucker project suggests
passing html through "tidy", to get well-formed html for plucker to work
with.

However, some of the pages I want to convert are so bad that even tidy
pukes on them.

I was thinking that BeautifulSoup might be more tolerant of really bad
html...  Which led me to the question this article started out with.  :)

Thanks!





More information about the Python-list mailing list