HTML purifier using BeautifulSoup?

Jonathan Clark j-clark at lineone.net
Fri Jan 7 13:10:30 EST 2005


Dan Stromberg wrote:
> Has anyone tried to construct an HTML janitor script using
BeautifulSoup?
>
> My situation:
>
> I'm trying to convert a series of web pages from .html to palmdoc
format,
> using plucker, which is written in python.  The plucker project
suggests
> passing html through "tidy", to get well-formed html for plucker to
work
> with.
>
> However, some of the pages I want to convert are so bad that even
tidy
> pukes on them.
>
> I was thinking that BeautifulSoup might be more tolerant of really
bad
> html...  Which led me to the question this article started out with.
:)
>
> Thanks!

I have used BeautifulSoup for screen scraping, pulling html into
structured form (using XML). Is that similar to a janitor script? I
used it because tidy was puking on some html. BS has been excellent.




More information about the Python-list mailing list