windows utf8 & lxml

Sayth Renshaw flebber.crue at gmail.com
Tue Dec 20 06:53:42 EST 2016


Hi 

I have been trying to get a script to work on windows that works on mint. The key blocker has been utf8 errors, most of which I have solved.

Now however the last error I am trying to overcome, the solution appears to be to use the .decode('windows-1252') to correct an ascii error.

I am using lxml to read my content and decode is not supported are there any known ways to read with lxml and fix unicode faults?

The key part of my script is 

        for content in roots:
            utf8_parser = etree.XMLParser(encoding='utf-8')
            fix_ascii = utf8_parser.decode('windows-1252')
            mytree = etree.fromstring(
                content.read().encode('utf-8'), parser=fix_ascii)

Without the added .decode my code looks like

        for content in roots:
            utf8_parser = etree.XMLParser(encoding='utf-8')
            mytree = etree.fromstring(
                content.read().encode('utf-8'), parser=utf8_parser)

However doing it in such a fashion returns this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
Which I found this SO for http://stackoverflow.com/a/29217546/461887 but cannot seem to implement with lxml.

Ideas?

Sayth



More information about the Python-list mailing list