[Python-ideas] Normalized Python

Chris Angelico rosuav at gmail.com
Wed Jan 29 15:33:56 CET 2014


On Wed, Jan 29, 2014 at 8:11 PM, anatoly techtonik <techtonik at gmail.com> wrote:
> Normalized Python - a set of default, standard behaviors that backup
> common user expectations about cross-platform and system-independent
> behavior regardless of backward compatibility and code compatibility
> concerns.
>
> Having a separate "Normalized Python" concept is needed to set
> the context for developing and engineering ideas, instead of
> concentrating on the sad reality of backward compatibility curse.

You can achieve the first two simply by opening files with parameters.
There is NOTHING Windows-specific or Linux-specific in that. As of
Python 3, opening in text mode is the default... but you can override
that so easily. Why change the default (which breaks back compat) when
you can just change your code?

And I believe you can reopen stdin/stdout as binary, if you really
want to, but that is a little harder. It's still not going to have any
platform-specific code in it. (As I've never written a filter for
binary files in Python, I've never had the need to read/write standard
streams in binary. But I've no doubt that someone who has can show you
how easy it is - I'd guess it's less than five lines of code, knowing
Python.)

> This is needed, for example, to collect these two features:

(Among our features are such diverse elements as... oh, wrong Pythons.)

> 1. open files in binary mode by default
> why?
>     because "text file" is a human abstraction, for operating
>     system it is just another format of binary data, so default
>     operation is to read this data without any preprocessing

A reasonably plausible argument. C++ follows that sort of model (you
shouldn't pay for anything you're not using). SQL mostly follows that
model (it generally takes more keywords to get the database to do more
work - compare "SELECT x FROM y" and "SELECT x FROM y ORDER BY z",
where the latter adds a sort phase; there are exceptions to this, like
UNION ALL vs UNION, but they're notable _because_ they're exceptions).
But it's nothing like a strong enough argument for changing. Creating
two subtly different languages is a major problem, especially when the
exact same syntax means different things. Imagine if I create a fork
of Python that's absolutely identical except that you create a set
with [1,2,3] and a list with {1,2,3}. All your code will be
syntactically correct, but suddenly it does something quite different.
That is a BAD idea. It would have to be *immensely* better to justify
the breakage; and this is only "arguably better". (The most obvious
contrary argument is that the default should do the thing most people
want most often, which is working with text files. This same argument
justifies the use of arbitrary-precision integers by default, instead
of requiring an explicit "long" type; I'm sure you'll agree that the
Py3 unification of these types was an advantage.)

> 2. open text files in utf-8 encoding
> why?
>     because users can not know the encoding of operating
>     system, their programs can not choose right encoding,
>     therefore a best guess is to expect the most widely used
>     standard

Yes, this one is an issue. Python lets the OS recommend a default
encoding, on the expectation that a Python script should fit into its
host platform, rather than that all platforms should conform to what
Python wants. A judgment call, and I'm sure there can be endless
debates about what Python should do, but since it can be overridden
with a single parameter on the open call, not a big deal IMO.

> 3. threat stdout/stdin streams as binary
> why?
>     because you don't want you data to be corrupt when
>     you pass it in and out of Python via standard streams

Most definitely NOT. The standard streams should, by default, be text
streams, and should have their encodings set according to what the
other side wants. If there's a way for the OS and Python to
communicate an encoding, that's absolutely perfect. Yes, there'll be a
few edge cases involving redirection, but that's pretty much
unsolvable anyway. The normal usage of Python MUST include Unicode;
and that means the most obvious way to produce output (the print
function) needs to write Unicode. So if stdout is a binary stream,
what's print going to do with a str? Encode it? If so, you just move
the issue - and print can send to multiple streams, so it'd need to
know which are text and which are binary, etc, etc. Or should it throw
an error, and force the programmer to do stuff like this:

CONSOLE_ENCODING = "utf-8" # add some logic for guessing this
s = "Hello, world!"
print(s.encode(CONSOLE_ENCODING))

just to ensure that every programmer has to battle with the encodings
manually, in lots of places, instead of configuring it once (or, more
likely, having the default be right) and then having clean code
everywhere?

The only way that opening stdin/out as binary will prevent the
corruption of your data is if your data is fundamentally bytes. Most
programs, in any language, work with data that's fundamentally text;
granted, a lot of languages don't distinguish, but if you look at what
the programmer's doing, it's still text. Anything that prints "Hello,
world!" is printing text, not bytes, and if the console's encoding is
UTF-16, that should emit 26 bytes (plus any newline that's
appropriate). Forcing the programmer to think about this is completely
unnecessary.

How many times do you actually come across these issues in porting?
How much effort would you really save if these measures were
implemented? If it's that important to you, fork CPython and create
this "Normalized Python" that does everything you want (and then,
linking this with the other thread, continue development of Normalized
Python according to an Agile model and see if people join you rather
than CPython). Good luck.

ChrisA


More information about the Python-ideas mailing list