[Python-ideas] Python 3 open() text files: make encoding parameter optional for cross-platform scripts

anatoly techtonik techtonik at gmail.com
Sun Jun 9 13:09:34 CEST 2013


On Sun, Jun 9, 2013 at 4:10 AM, Andrew Barnert <abarnert at yahoo.com> wrote:

> From: anatoly techtonik <techtonik at gmail.com>
> Sent: Saturday, June 8, 2013 6:13 AM
>
>  >open() in Python uses system encoding to read files by default. So, if
> Python script writes text file with some Cyrillic character on my Russian
> Windows, another Python script on English Windows or Greek Windows will not
> be able to read it. This is just what happened.
>
>
> True. But if I create a text file with "type >foo.txt" or Notepad on a
> Russian system, I won't be able to open it on that English or Greek system
> either… but at least I'll be able to open it on Python on the same Russian
> system. If you changed Python to ignore the locale, that would cause a new
> problem (the latter would no longer be true) without fixing any existing
> problem (the former would still not be true).


I'd say that all popular user software on two language Windows that opens
plain text files uses auto-detection of encoding. Unlike Linux, where such
bugs are not important (I an speaking about GEdit in particular), because
people sure that I will open only ASCII files on my Ubuntu box.


> >The solution proposed is to specify encoding explicitly. That means I
> have to know it. Luckily, in this case the text file is my .py where I knew
> the encoding beforehand. In real world you can never know the encoding
> beforehand.
>
> This is an inherent problem that Python didn't cause, and can't solve. As
> long as there are text files in different encodings out there, you need to
> pass the encoding out-of-band. If you're behind the process that creates
> the files (whether it's a program you wrote, or options you set in
> Notepad's Save As dialog), you can just make sure to use the same encoding
> on every system, and you have no problem. But if you need to deal with
> files that others have created, that won't work.


Right. So this way or the other - you will inevitably face with the problem
that user installed your Python program to open file that is in different
encoding than expected. The major difference is in the procedure that you,
as a software developer, will undergo to troubleshoot and cover the issue.
In case of implicit system encoding you'll have to deal with magic of
detecting user system encoding, will have to mock this magic in your tests.
In case of explicit 'utf-8' setting you will fail to open it right away and
it will be system independent -1 to head ache.

>So, what should Python do if it doesn't know the encoding of text file  it
> opens:
> >1. Assume that encoding of text file is the encoding of your operating
> system
> >2. Assume that encoding of text file is ASCII
> >3. Assume that encoding of text file is UTF-8
> >
>
> >Please write in reply and then scroll down.
>
> That order happens to be exactly my preference. #1 helps for one very
> common problem—files created by other programs on the same machine. #2 is
> generally at least safe, in that you'll get an error instead of mojibake.
> #3 doesn't really help anything.
>

So far 5+ vs 1-, and 4 people without personal preference. IIRC, Python 2
on Windows open()ing text files with operating system encoding always
results in error. As for mojibaje you need to be very explicit in Python to
get it.

>I propose three, because ASCII is a binary compatible subset of UTF-8.
> Choice one is the current behaviour, and it is very bad. Troubleshooting
> this issue, which should be very common, requires a lot of prior knowledge
> about encodings and awareness of difference system defaults. For
> cross-platform work with text files this fact implicitly requires you to
> always use 'encoding' parameter for open().
>
> I'm not sure what you mean by "cross-platform" here. Most non-Windows
> platforms nowadays set the locale to UTF-8 by default (and if you're using
> an older *nix, or deliberately chose not to use UTF-8 even though it's the
> default, you already know how to deal with these issues). So, it's really a
> Windows problem, if anything.
>

This argument makes me assume that perhaps people who are not on Windows
and not using localized OS version are not fully realize the problem,
because their native encoding is already UTF-8. Therefore they receive and
commit files in UTF-8 and don't see when three is a file created on
different system with different encoding, which will cause problems.

Switching to UTF-8 would make it harder to read and write files created by
> other programs on the same machine


Sorry for breaking the quote, but this argument is not generally valid.
There is a high percentage of software that by default, explicitly create
plain text files in UTF-8 encoding regardless of system defaults. I may
even assume that correct percentage of such programs is higher.


> —and it still wouldn't magically make you able to read and write files
> created on other machines, unless you only care about files created on
> recent *nix platforms. The only case it would help is making it easier to
> read and write files created by _your program_ without worrying about the
> local machine. While that isn't _nothing_, I don't think it's so important
> that we can just dismiss dealing with files created by other programs.
>

I can not agree with your generic priorities in approach to application
desing, which are:
1. Program should be able to read 3rd-party files produced on the same
system
2. Program should be able to read its own files on any system

My choice is 2 then 1.


> After all, you're presumably using plain text files, rather than some
> binary format or JSON or YAML or XML or whatever, because you want users to
> be able to view and edit those files, right?
>

Vice versa. For user editable files I make sure that it is JSON, YAML or
XML that can be easy validated against user errors. System configuration
files are using UTF-8 compatible English ASCII set, all other files are
source files that are checked out into version control system and by
definition should be cross-platform compatible. My Python tools work with
platform independent files, and follow the "In the face of ambiguity,
refuse the temptation to guess." Zen by either detecting or requiring
specific encoding standard. Their behavior was deterministic until I ported
them to Python 3.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20130609/d38f6828/attachment-0001.html>


More information about the Python-ideas mailing list