Guessing the encoding from a BOM

Steven D'Aprano steve at pearwood.info
Thu Jan 16 01:45:38 EST 2014


On Thu, 16 Jan 2014 16:01:56 +1100, Chris Angelico wrote:

> On Thu, Jan 16, 2014 at 1:13 PM, Steven D'Aprano
> <steve+comp.lang.python at pearwood.info> wrote:
>>     if sig.startswith((b'\xFE\xFF', b'\xFF\xFE')):
>>         return 'utf_16'
>>     elif sig.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')):
>>         return 'utf_32'
> 
> I'd swap the order of these two checks. If the file starts FF FE 00 00,
> your code will guess that it's UTF-16 and begins with a U+0000.

Good catch, thank you.


-- 
Steven



More information about the Python-list mailing list