encoding in lxml

Mon Nov 3 13:50:37 EST 2008

Hi Mike,

> I read an HTML document from a third-party site. It is supposed to be
> in UTF-8, but unfortunately from time to time it's not.

There will be host of more lightweight solutions, but you can opt
to sanizite incominhg HTML with HTML Tidy (python binding available).

It will replace invalid UTF-8 bytes with U+FFFD. It will not
guess a better encoding to use.

If you are sure you don't have HTML sloppiness to correct but only
the
occasional wrong byte, even decoding (with fallback) and encoding
using
the standard codec package will do.

Regards,
Peter