encoding in lxml

Mon Nov 3 06:43:26 EST 2008

Hey,

I have a problem with character encoding in LXML. Here's how it goes:

I read an HTML document from a third-party site. It is supposed to be
in UTF-8, but unfortunately from time to time it's not. I parse the
document like this:

html_doc = HTML(string_with_document)

Then I retrieve some info from the document with XPath:

xpath_nodes = html_doc('/html/body/something')

Now I'm guaranteed that the xpath_nodes list contains only one
element. So I read it's content:

xpath_nodes[0].text

And I get exception here. The exception is coming from the text
property of an Element object. The problem is that the text contains a
non-utf8 character. LXML seems to be using strict decoding and I can't
find a way to make it ignore the error. Is there anything I can do to
retrieve the text without getting an exception?

Regards,

Mike