[Python-Dev] open(): set the default encoding to 'utf-8' in Python 3.3?

Terry Reedy tjreedy at udel.edu
Tue Jun 28 16:24:58 CEST 2011


On 6/28/2011 9:43 AM, Victor Stinner wrote:
> In Python 2, open() opens the file in binary mode (e.g. file.readline()
> returns a byte string). codecs.open() opens the file in binary mode by
> default, you have to specify an encoding name to open it in text mode.
>
> In Python 3, open() opens the file in text mode by default. (It only
> opens the binary mode if the file mode contains "b".) The problem is
> that open() uses the locale encoding if the encoding is not specified,
> which is the case *by default*. The locale encoding can be:
>
>   - UTF-8 on Mac OS X, most Linux distributions
>   - ISO-8859-1 os some FreeBSD systems
>   - ANSI code page on Windows, e.g. cp1252 (close to ISO-8859-1) in
> Western Europe, cp952 in Japan, ...
>   - ASCII if the locale is manually set to an empty string or to "C", or
> if the environment is empty, or by default on some systems
>   - something different depending on the system and user configuration...
>
> If you develop under Mac OS X or Linux, you may have surprises when you
> run your program on Windows on the first non-ASCII character. You may
> not detect the problem if you only write text in english... until
> someone writes the first letter with a diacritic.
>
>
>
> As discussed before on this list, I propose to set the default encoding
> of open() to UTF-8 in Python 3.3, and add a warning in Python 3.2 if
> open() is called without an explicit encoding and if the locale encoding
> is not UTF-8. Using the warning, you will quickly notice the potential
> problem (using Python 3.2.2 and -Werror) on Windows or by using a
> different locale encoding (.e.g using LANG="C").
>
> I expect a lot of warnings from the Python standard library, and as many
> in third party modules and applications. So do you think that it is too
> late to change that in Python 3.3? One argument for changing it directly
> in Python 3.3 is that most users will not notice the change because
> their locale encoding is already UTF-8.
>
> An alternative is to:
>   - Python 3.2: use the locale encoding but emit a warning if the locale
> encoding is not UTF-8
>   - Python 3.3: use UTF-8 and emit a warning if the locale encoding is
> not UTF-8... or maybe always emit a warning?
>   - Python 3.3: use UTF-8 (but don't emit warnings anymore)
>
> I don't think that Windows developer even know that they are writing
> files into the ANSI code page. MSDN documentation of
> WideCharToMultiByte() warns developer that the ANSI code page is not
> portable, even accross Windows computers:
>
> "The ANSI code pages can be different on different computers, or can be
> changed for a single computer, leading to data corruption. For the most
> consistent results, applications should use Unicode, such as UTF-8 or
> UTF-16, instead of a specific code page, unless legacy standards or data
> formats prevent the use of Unicode. If using Unicode is not possible,
> applications should tag the data stream with the appropriate encoding
> name when protocols allow it. HTML and XML files allow tagging, but text
> files do not."
>
> It will always be possible to use ANSI code page using
> encoding="mbcs" (only work on Windows), or an explicit code page number
> (e.g. encoding="cp2152").
>
> --
>
> The two other (rejetected?) options to improve open() are:
>
> - raise an error if the encoding argument is not set: will break most
> programs
> - emit a warning if the encoding argument is not set
>
> --
>
> Should I convert this email into a PEP, or is it not required?

I think a PEP is needed.

-- 
Terry Jan Reedy



More information about the Python-Dev mailing list