[Python-Dev] open(): set the default encoding to 'utf-8' in Python 3.3?

M.-A. Lemburg mal at egenix.com
Wed Jun 29 12:20:42 CEST 2011


Victor Stinner wrote:
> Le mercredi 29 juin 2011 à 10:18 +0200, M.-A. Lemburg a écrit :
>> Victor Stinner wrote:
>>> Le mardi 28 juin 2011 à 16:02 +0200, M.-A. Lemburg a écrit :
>>>> How about a more radical change: have open() in Py3 default to
>>>> opening the file in binary mode, if no encoding is given (even
>>>> if the mode doesn't include 'b') ?
>>>
>>> I tried your suggested change: Python doesn't start.
>>
>> No surprise there: it's an incompatible change, but one that undoes
>> a wart introduced in the Py3 transition. Guessing encodings should
>> be avoided whenever possible.
> 
> It means that all programs written for Python 3.0, 3.1, 3.2 will stop
> working with the new 3.x version (let say 3.3). Users will have to
> migrate from Python 2 to Python 3.2, and then migration from Python 3.2
> to Python 3.3 :-(

I wasn't suggesting doing this for 3.3, but we may want to start
the usual feature change process to make the change eventually
happen.

> I would prefer a ResourceWarning (emited if the encoding is not
> specified), hidden by default: it doesn't break compatibility, and
> -Werror gives exactly the same behaviour that you expect.

ResourceWarning is the wrong type of warning for this. I'd
suggest to use a UnicodeWarning or perhaps create a new
EncodingWarning instead.

>> This demonstrates that Python's stdlib is still not being explicit
>> about the encoding issues. I suppose that things just happen to work
>> because we mostly use ASCII files for configuration and setup.
> 
> I did more tests. I found some mistakes and sometimes the binary mode
> can be used, but most function really expect the locale encoding (it is
> the correct encoding to read-write files). I agree that it would be to
> have an explicit encoding="locale", but make it mandatory is a little
> bit rude.

Again: Using a locale based default encoding will not work out
in the long run. We've had those discussions many times in the
past.

I don't think there's anything bad with having the user require
to set an encoding if he wants to read text. It makes him/her
think twice about the encoding issue, which is good.

And, of course, the stdlib should start using this
explicit-is-better-than-implicit approach as well.

>>> Then I tried my suggestion (use "utf-8" by default): Python starts
>>> correctly, I can build it (run "make") and... the full test suite pass
>>> without any change. (I'm testing on Linux, my locale encoding is UTF-8.)
>>
>> I bet it would also with "ascii" in most cases. Which then just
>> means that the Python build process and test suite is not a good
>> test case for choosing a default encoding.
>>
>> Linux is also a poor test candidate for this, since most user setups
>> will use UTF-8 as locale encoding. Windows, OTOH, uses all sorts of
>> code page encodings (usually not UTF-8), so you are likely to hit
>> the real problem cases a lot easier.
> 
> I also ran the test suite on my patched Python (open uses UTF-8 by
> default) with ASCII locale encoding (LANG=C), the test suite does also
> pass. Many tests uses non-ASCII characters, some of them are skipped if
> the locale encoding is unable to encode the tested text.

Thanks for checking. So the build process and test suite are
indeed not suitable test cases for the problem at hand. With
just ASCII files to decode, Python will simply never fail
to decode the content, regardless of whether you use an ASCII,
UTF-8 or some Windows code page as locale encoding.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jun 29 2011)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/


More information about the Python-Dev mailing list