[Python-ideas] Python 3 open() text files: make encoding parameter optional for cross-platform scripts

Victor Stinner victor.stinner at gmail.com
Sun Jun 9 00:30:25 CEST 2013


Changing the default encoding of open() was already discussed 2 years ago.
See this discussion:
http://mail.python.org/pipermail/python-dev/2011-June/112086.html

I did a long analysis of the Python standard library and I tried a modified
Python with a default encoding set to utf-8.

The conclusion is that the locale encoding is the least worst choice. The
main reason is the compatibility with all other applications on the same
computer. Using a different encoding than the locale encoding leads quickly
to mojibake issues and other bugs.

Just one example: configure script generates a Makefile using the locale
encoding, Python gets data from Makefile. If you use a path with non-ascii
character, use utf-8 in python whereas the locale is iso-8859-1,  python
cannot be compiled anymore or will refuse to start.

Remember the zen of python: explicit is better of implicit. So set encoding
parameter in your code.

When i made the encoding mandatory in my test, more than 70% of calls to
open() used encoding="locale". So it's simpler to keep the current default
choice.

The documentation can maybe be improved?

Victor
Le 8 juin 2013 15:14, "anatoly techtonik" <techtonik at gmail.com> a écrit :

> Without reading subject of this letter, what is your idea about which
> encoding Python 3 uses with open() calls on a text file? Please write in
> reply and then scroll down.
>
>
> Without cheating my opinion was cp1252 (latin-1), because it was the way
> Python 2 assumed all text files are. Or Python 2 uses ISO-8859-1?
>
> But it appeared to be different way -
> http://docs.python.org/3/library/functions.html#open. No, it appeared
> here - https://bitbucket.org/techtonik/hexdump/pull-request/1/ and after
> a small lecture I realized how things are bad.
>
> open() in Python uses system encoding to read files by default. So, if
> Python script writes text file with some Cyrillic character on my Russian
> Windows, another Python script on English Windows or Greek Windows will not
> be able to read it. This is just what happened.
>
> The solution proposed is to specify encoding explicitly. That means I have
> to know it. Luckily, in this case the text file is my .py where I knew the
> encoding beforehand. In real world you can never know the encoding
> beforehand.
>
> So, what should Python do if it doesn't know the encoding of text file it
> opens:
> 1. Assume that encoding of text file is the encoding of your operating
> system
> 2. Assume that encoding of text file is ASCII
> 3. Assume that encoding of text file is UTF-8
>
> Please write in reply and then scroll down.
>
>
> I propose three, because ASCII is a binary compatible subset of UTF-8.
> Choice one is the current behaviour, and it is very bad. Troubleshooting
> this issue, which should be very common, requires a lot of prior knowledge
> about encodings and awareness of difference system defaults. For
> cross-platform work with text files this fact implicitly requires you to
> always use 'encoding' parameter for open().
>
>
> Is it enough for a PEP? This stuff is rather critical IMO, so even if it
> will be rejected there will be a documented design decision.
> --
> anatoly t.
>
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20130609/05be6867/attachment.html>


More information about the Python-ideas mailing list