Readlines returns non ASCII character

MRAB python at mrabarnett.plus.com
Wed Sep 23 22:02:21 EDT 2015


On 2015-09-24 02:37, Ian Kelly wrote:
> On Wed, Sep 23, 2015 at 6:09 PM, MRAB <python at mrabarnett.plus.com> wrote:
>> On 2015-09-24 00:51, paul.hermeneutic at gmail.com wrote:
>>>
>>>   If this starts at the beginning of the file, then it indicates that
>>> the file is UTF-16 (LE).
>>>
>>> UTF-8[t 1]     EF BB BF       239 187 191
>>> UTF-16 (BE)    FE FF          254 255
>>> UTF-16 (LE)    FF FE          255 254
>>> UTF-32 (BE)    00 00 FE FF    0 0 254 255
>>> UTF-32 (LE)    FF FE 00 00    255 254 0 0
>>>
>> The "signature" EF BB BF indicates the encoding called "utf-8-sig" by
>> Python. It occurs on Windows.
>>
>> If the file doesn't start with any of these, then it could be using any
>> encoding (except UTF-16 or UTF-32).
>
> Yes, but what does it mean when the signature is 00 FF 00 FE 00 FF and
> occurs not at the beginning but repeatedly throughout the file, as
> appears in the OP's case?
>
> At least, I'm assuming that the high-order bytes are 00 based on what
> the OP posted. I wouldn't be surprised though if they're just being
> mangled by the terminal, if it happens to be a certain one that will
> not be named but uses CP 1252.
>
Yes, a byte-string literal or a hex dump of, say, the first 256 bytes
would've been better.




More information about the Python-list mailing list