windows utf8 & lxml

Tue Dec 27 05:46:35 EST 2016

On Tue, 20 Dec 2016 10:53 pm, Sayth Renshaw wrote:

> content.read().encode('utf-8'), parser=utf8_parser)
> 
> However doing it in such a fashion returns this error:
> 
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0:
> invalid start byte

That tells you that the XML file you have is not actually UTF-8.

You have a file that begins with a byte 0xFF. That is invalid UTF-8. No
valid UTF-8 string contains the byte 0xFF.

https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences

So you need to consider:

- Are you sure that the input file is intended to be UTF-8? How was it
created? 

- Is the second byte 0xFE? If so, that suggests that you actually have
UTF-16 with a byte-order mark.

-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.