HTML Parser chokes on WordHTML...

JanC usenet_spam at janc.invalid
Fri May 2 20:38:55 EDT 2003


Harald Massa <cpl.19.ghum at spamgourmet.com> schreef:

> So... is there any replacement for the HTMLParser from the python.lib
> which even can eat Microsoft Word HTML ? 

Maybe try to process the Word pseudo-HTML with "HTML Tidy" before you feed 
it to HTMLParser?

<http://tidy.sourceforge.net/>
<http://tidy.sourceforge.net/docs/quickref.html#word-2000>

You could wrap tidylib for use inside Python too:
<http://tidy.sourceforge.net/libintro.html>

-- 
JanC

"Be strict when sending and tolerant when receiving."
RFC 1958 - Architectural Principles of the Internet - section 3.9




More information about the Python-list mailing list