[Python-Dev] open(): set the default encoding to 'utf-8' in Python 3.3?

Victor Stinner victor.stinner at haypocalc.com
Tue Jun 28 15:43:05 CEST 2011


In Python 2, open() opens the file in binary mode (e.g. file.readline()
returns a byte string). codecs.open() opens the file in binary mode by
default, you have to specify an encoding name to open it in text mode.

In Python 3, open() opens the file in text mode by default. (It only
opens the binary mode if the file mode contains "b".) The problem is
that open() uses the locale encoding if the encoding is not specified,
which is the case *by default*. The locale encoding can be:

 - UTF-8 on Mac OS X, most Linux distributions
 - ISO-8859-1 os some FreeBSD systems
 - ANSI code page on Windows, e.g. cp1252 (close to ISO-8859-1) in
Western Europe, cp952 in Japan, ...
 - ASCII if the locale is manually set to an empty string or to "C", or
if the environment is empty, or by default on some systems
 - something different depending on the system and user configuration...

If you develop under Mac OS X or Linux, you may have surprises when you
run your program on Windows on the first non-ASCII character. You may
not detect the problem if you only write text in english... until
someone writes the first letter with a diacritic.



As discussed before on this list, I propose to set the default encoding
of open() to UTF-8 in Python 3.3, and add a warning in Python 3.2 if
open() is called without an explicit encoding and if the locale encoding
is not UTF-8. Using the warning, you will quickly notice the potential
problem (using Python 3.2.2 and -Werror) on Windows or by using a
different locale encoding (.e.g using LANG="C").

I expect a lot of warnings from the Python standard library, and as many
in third party modules and applications. So do you think that it is too
late to change that in Python 3.3? One argument for changing it directly
in Python 3.3 is that most users will not notice the change because
their locale encoding is already UTF-8.

An alternative is to:
 - Python 3.2: use the locale encoding but emit a warning if the locale
encoding is not UTF-8
 - Python 3.3: use UTF-8 and emit a warning if the locale encoding is
not UTF-8... or maybe always emit a warning?
 - Python 3.3: use UTF-8 (but don't emit warnings anymore)

I don't think that Windows developer even know that they are writing
files into the ANSI code page. MSDN documentation of
WideCharToMultiByte() warns developer that the ANSI code page is not
portable, even accross Windows computers:

"The ANSI code pages can be different on different computers, or can be
changed for a single computer, leading to data corruption. For the most
consistent results, applications should use Unicode, such as UTF-8 or
UTF-16, instead of a specific code page, unless legacy standards or data
formats prevent the use of Unicode. If using Unicode is not possible,
applications should tag the data stream with the appropriate encoding
name when protocols allow it. HTML and XML files allow tagging, but text
files do not."

It will always be possible to use ANSI code page using
encoding="mbcs" (only work on Windows), or an explicit code page number
(e.g. encoding="cp2152").

--

The two other (rejetected?) options to improve open() are:

- raise an error if the encoding argument is not set: will break most
programs
- emit a warning if the encoding argument is not set

--

Should I convert this email into a PEP, or is it not required?

Victor



More information about the Python-Dev mailing list