[Python-ideas] Python 3 open() text files: make encoding parameter optional for cross-platform scripts

anatoly techtonik techtonik at gmail.com
Sun Jun 9 12:00:51 CEST 2013


On Sun, Jun 9, 2013 at 1:30 AM, Victor Stinner <victor.stinner at gmail.com>wrote:

> Changing the default encoding of open() was already discussed 2 years ago.
> See this discussion:
> http://mail.python.org/pipermail/python-dev/2011-June/112086.html
>
> I did a long analysis of the Python standard library and I tried a
> modified Python with a default encoding set to utf-8.
>
> The conclusion is that the locale encoding is the least worst choice. The
> main reason is the compatibility with all other applications on the same
> computer. Using a different encoding than the locale encoding leads quickly
> to mojibake issues and other bugs.
>
Any default encoding means deterministic behavior for an open() call with
the same set of input data.

 For a cross-platform language, as a programmer, you're responsible to
detect the particular feature of operating

> Just one example: configure script generates a Makefile using the locale
> encoding, Python gets data from Makefile. If you use a path with non-ascii
> character, use utf-8 in python whereas the locale is iso-8859-1,  python
> cannot be compiled anymore or will refuse to start.
>
I am not a C developer, but as SCons committer I don't know Python tools
that directly work with Makefiles. To me that example with Makefile
generated by configure is out of Python domain, so real examples are still
welcome.

> Remember the zen of python: explicit is better of implicit. So set
> encoding parameter in your code.
>
=)

And because of that Zen, the prototype to open is:
open(..., encoding=None)
  instead of
open(..., encoding='utf-8')
  or
open(..., encoding=sys.encoding)

This choice also breaks key Unix principle of doing one thing good, because
it is not the responsibility of open() call to determine system encoding.
Maybe sys.open() would be better for that? The subprocess precedent of
overly complicated cross-platform logic are not surviving open source
development, and it's a pity that it didn't serve a lesson for language
design decision.

> When i made the encoding mandatory in my test, more than 70% of calls to
> open() used encoding="locale". So it's simpler to keep the current default
> choice.
>
How many systems have you covered? Something makes me think that you had
deterministic behavior for all your cases, because you run them on a single
system. Most packages distributed from PyPI are designed to be
cross-platform, and most of them use persistence schemes that are either
pickled (speed) or system independent (portability).

> The documentation can maybe be improved?
>
I doubt that it can be improved - simple Python functions are already
complicated enough. I wish there was a reverse process of simplifying
things back.

Victor
> Le 8 juin 2013 15:14, "anatoly techtonik" <techtonik at gmail.com> a écrit :
>
>>  Without reading subject of this letter, what is your idea about which
>> encoding Python 3 uses with open() calls on a text file? Please write in
>> reply and then scroll down.
>>
>>
>> Without cheating my opinion was cp1252 (latin-1), because it was the way
>> Python 2 assumed all text files are. Or Python 2 uses ISO-8859-1?
>>
>> But it appeared to be different way -
>> http://docs.python.org/3/library/functions.html#open. No, it appeared
>> here - https://bitbucket.org/techtonik/hexdump/pull-request/1/ and after
>> a small lecture I realized how things are bad.
>>
>> open() in Python uses system encoding to read files by default. So, if
>> Python script writes text file with some Cyrillic character on my Russian
>> Windows, another Python script on English Windows or Greek Windows will not
>> be able to read it. This is just what happened.
>>
>> The solution proposed is to specify encoding explicitly. That means I
>> have to know it. Luckily, in this case the text file is my .py where I knew
>> the encoding beforehand. In real world you can never know the encoding
>> beforehand.
>>
>> So, what should Python do if it doesn't know the encoding of text file it
>> opens:
>> 1. Assume that encoding of text file is the encoding of your operating
>> system
>> 2. Assume that encoding of text file is ASCII
>> 3. Assume that encoding of text file is UTF-8
>>
>> Please write in reply and then scroll down.
>>
>>
>> I propose three, because ASCII is a binary compatible subset of UTF-8.
>> Choice one is the current behaviour, and it is very bad. Troubleshooting
>> this issue, which should be very common, requires a lot of prior knowledge
>> about encodings and awareness of difference system defaults. For
>> cross-platform work with text files this fact implicitly requires you to
>> always use 'encoding' parameter for open().
>>
>>
>> Is it enough for a PEP? This stuff is rather critical IMO, so even if it
>> will be rejected there will be a documented design decision.
>> --
>> anatoly t.
>>
>> _______________________________________________
>> Python-ideas mailing list
>> Python-ideas at python.org
>> http://mail.python.org/mailman/listinfo/python-ideas
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20130609/43591223/attachment.html>


More information about the Python-ideas mailing list