Getting Unicode decode error using lxml.iterparse

Stefan Behnel stefan_ml at behnel.de
Wed May 23 13:24:35 EDT 2018


dieter schrieb am 23.05.2018 um 08:25:
> If the encoding is not specified, "lxml" will try to determine it
> and finally defaults to "utf-8" (which seems to be the correct encoding
> for your case).

Being an XML parser, it does not do that. XML parsers are designed to
reject non-wellformed content, and that includes anything that cannot be
decoded.

In short, if no encoding is specified, then it's UTF-8, but if there is an
XML declaration that specifies that encoding, then it uses that encoding.

Here, the encoding is specifed as UTF-8, so that's what the parser uses.

Note, however, that the library that the OP uses is not lxml but xml.etree,
i.e. the ElementTree XML support in the standard library.

Stefan




More information about the Python-list mailing list