clean up html document created by Word

bearophileHUGS at lycos.com bearophileHUGS at lycos.com
Fri Mar 30 13:52:23 EDT 2007


jd:
> I am looking for python code (working or sample code) that can take an
> html document created by Microsoft Word and clean it up (if you've
> never had to look at a Word-generated html document, consider yourself
> lucky ;-)  Alternatively, if you know of a non-python solution, I'd
> like to hear about it.

It's not an easy job, and it may require some manual editing, because
that html is the worst I have seen. You can use Tidy, there is a GUI
too, and you can use its suggestions to manually remove the offending
things, at the end Tidy is able to digest it, and return a cleaned up
html. But then you have just started, you need to process it even
more.

A solution is to avoid creating the Html in the first place, or to use
something more like Word 97 to create it. Dreamweaver too is able to
help with Word2000+ trashy html, but usually not enough.

If the structure of the Html document is simple enough, and assuming
you are using Windows, you can open it with Word, save it as RTF,
reopen it with Wordpad, save it again to remove some trash, and then
use something else (like Word 97, or maybe even Aracnophobia, etc) to
convert it to Html. Generally I've never found a really good way to
convert Rtf to a very good Html.

Bye,
bearophile




More information about the Python-list mailing list