[Python-ideas] Fix default encodings on Windows

Steven D'Aprano steve at pearwood.info
Wed Aug 10 23:14:04 EDT 2016


On Wed, Aug 10, 2016 at 04:40:31PM -0700, Steve Dower wrote:

> On 10Aug2016 1431, Chris Angelico wrote:
> >>* make the default open() encoding check for a BOM or else use utf-8
> >
> >-0.5. Is there any precedent for this kind of data-based detection
> >being the default?

There is precedent: the Python interpreter will accept a BOM instead of 
an encoding cookie when importing .py files.


[Chris]
> >An explicit "utf-sig" could do a full detection,
> >but even then it's not perfect - how do you distinguish UTF-32LE from
> >UTF-16LE that starts with U+0000? 

BOMs are a heuristic, nothing more. If you're reading arbitrary files 
could start with anything, then of course they can guess wrong. But then 
if I dumped a bunch of arbitrary Unicode codepoints in your lap and 
asked you to guess the language, you would likely get it wrong too :-)

[Chris]
> >Do you say "UTF-32 is rare so we'll
> >assume UTF-16", or do you say "files starting U+0000 are rare, so
> >we'll assume UTF-32"?

The way I have done auto-detection based on BOMs is you start by reading 
four bytes from the file in binary mode. (If there are fewer than four 
bytes, it cannot be a text file with a BOM.) Compare those first four 
bytes against the UTF-32 BOMs first, and the UTF-16 BOMs *second* 
(otherwise UTF-16 will shadow UFT-32). Note that there are two BOMs 
(big-endian and little-endian). Then check for UTF-8, and if you're 
really keen, UTF-7 and UTF-1.

def bom2enc(bom, default=None):
    """Return encoding name from a four-byte BOM."""
    if bom.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')):
        return 'utf_32'
    elif bom.startswith((b'\xFE\xFF', b'\xFF\xFE')):
        return 'utf_16'
    elif bom.startswith(b'\xEF\xBB\xBF'):
        return 'utf_8_sig'
    elif bom.startswith(b'\x2B\x2F\x76'):
        if len(bom) == 4 and bom[4] in b'\x2B\x2F\x38\x39':
            return 'utf_7'
    elif bom.startswith(b'\xF7\x64\x4C'):
        return 'utf_1'
    elif default is None:
        raise ValueError('no recognisable BOM signature')
    else:
        return default



[Steve Dower]
> The BOM exists solely for data-based detection, and the UTF-8 BOM is 
> different from the UTF-16 and UTF-32 ones. So we either find an exact 
> BOM (which IIRC decodes as a no-op spacing character, though I have a 
> feeling some version of Unicode redefined it exclusively for being the 
> marker) or we use utf-8.

The Byte Order Mark is always U+FEFF encoded into whatever bytes your 
encoding uses. You should never use U+FEFF except as a BOM, but of 
course arbitrary Unicode strings might include it in the middle of the 
string Just Because. In that case, it may be interpreted as a legacy 
"ZERO WIDTH NON-BREAKING SPACE" character. But new content should never 
do that: you should use U+2060 "WORD JOINER" instead, and treat a U+FEFF 
inside the body of your file or string as an unsupported character.

http://www.unicode.org/faq/utf_bom.html#BOM


[Steve]
> But the main reason for detecting the BOM is that currently opening 
> files with 'utf-8' does not skip the BOM if it exists. I'd be quite 
> happy with changing the default encoding to:
> 
> * utf-8-sig when reading (so the UTF-8 BOM is skipped if it exists)
> * utf-8 when writing (so the BOM is *not* written)

Sounds reasonable to me.

Rather than hard-coding that behaviour, can we have a new encoding that 
does that? "utf-8-readsig" perhaps.


[Steve]
> This provides the best compatibility when reading/writing files without 
> making any guesses. We could reasonably extend this to read utf-16 and 
> utf-32 if they have a BOM, but that's an extension and not necessary for 
> the main change.

The use of a BOM is always a guess :-) Maybe I just happen to have a 
Latin1 file that starts with "", or a Mac Roman file that starts with 
"Ôªø". Either case will be wrongly detected as UTF-8. That's the risk 
you take when using a heuristic.

And if you don't want to use that heuristic, then you must specify the 
actual encoding in use.


-- 
Steven D'Aprano


More information about the Python-ideas mailing list