[Python-Dev] Internal representation of strings and Micropython

Thu Jun 5 15:15:54 CEST 2014

On 5 June 2014 22:37, Paul Sokolovsky <pmiscml at gmail.com> wrote:
> On Thu, 5 Jun 2014 22:20:04 +1000
> Nick Coghlan <ncoghlan at gmail.com> wrote:
>> problems caused by trusting the locale encoding to be correct, but the
>> startup code will need non-trivial changes for that to happen - the
>> C.UTF-8 locale may even become widespread before we get there).
>
> ... And until those golden times come, it would be nice if Python did
> not force its perfect world model, which unfortunately is not based on
> surrounding reality, and let users solve their encoding problems
> themselves - when they need, because again, one can go quite a long way
> without dealing with encodings at all. Whereas now Python3 forces users
> to deal with encoding almost universally, but forcing a particular for
> all strings (which is again, doesn't correspond to the state of
> surrounding reality). I already hear response that it's good that users
> taught to deal with encoding, that will make them write correct
> programs, but that's a bit far away from the original aim of making it
> write "correct" programs easy and pleasant. (And definition of
> "correct" vary.)

As I've said before in other contexts, find me Windows, Mac OS X and
JVM developers, or educators and scientists that are as concerned by
the text model changes as folks that are primarily focused on Linux
system (including network) programming, and I'll be more willing to
concede the point.

Windows, Mac OS X, and the JVM are all opinionated about the text
encodings to be used at platform boundaries (using UTF-16, UTF-8 and
UTF-16, respectively). By contrast, Linux (or, more accurately, POSIX)
says "well, it's configurable, but we won't provide a reliable
mechanism for finding out what the encoding is. So either guess as
best you can based on the info the OS *does* provide, assume UTF-8,
assume 'some ASCII compatible encoding', or don't do anything that
requires knowing the encoding of the data being exchanged with the OS,
like, say, displaying file names to users or accepting arbitrary text
as input, transforming it in a content aware fashion, and echoing it
back in a console application".

None of those options are perfectly good choices. 6(ish) years ago, we
chose the first option, because it has the best chance of working
properly on Linux systems that use ASCII incompatible encodings like
ShiftJIS, ISO-2022, and various other East Asian codecs. For normal
user space programming, Linux is pretty reliable when it comes to
ensuring the locale encoding is set to something sensible, but the
price we currently pay for that decision is interoperability issues
with things like daemons not receiving any configuration settings and
hence falling back the POSIX locale and ssh environment forwarding
moving a clients encoding settings to a session on a server with
different settings. I still consider it preferable to impose
inconveniences like that based on use case (situations where Linux
systems don't provide sensible encoding settings) than geographic
region (locales where ASCII incompatible encodings are likely to still
be in common use).

If I (or someone else) ever find the time to implement PEP 432 (or
something like it) to address some of the limitations of the
interpreter startup sequence that currently make it difficult to avoid
relying on the POSIX locale encoding on Linux, then we'll be in a
position to reassess that decision based on the increased adoption of
UTF-8 by Linux distributions in recent years. As the major community
Linux distributions complete the migration of their system utilities
to Python 3, we'll get to see if they decide it's better to make their
locale settings more reliable, or help make it easier for Python 3 to
ignore them when they're wrong.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia