[Python-3000] locale-aware strings ?

Wed Sep 6 02:32:28 CEST 2006

Guido van Rossum wrote:
> On 9/5/06, Paul Prescod <paul at prescod.net> wrote:
> 
>> Beyond all of that: It just seems wrong to me that I could send someone a
>> bunch of files and a Python program and their results processing them
>> would be different from mine, despite the fact that we run the same version of
>> Python on the same operating system.
> 
> And it seems just as wrong if Python doesn't do what the user expects.
> If I were a beginning Python user, I'd hate it if I had prepared a
> simple data file in vi or notepad and my Python program wouldn't read
> it right because Python's idea of encoding differs from my editor's.

I don't know about vi, but notepad will open and save files that are not in
the system ("ANSI") encoding just fine. On opening it checks for a BOM and
auto-detects UTF-8 and UTF-16; on saving it will write a BOM if you choose
"Unicode" (UTF-16LE), "Unicode big-endian" (UTF-16BE), or UTF-8 in the
Encoding drop-down box.

This is exactly the behaviour that most users would expect of a well-behaved
Unicode-aware app. It should be as easy as possible to match this behaviour
in a Python program.

> Sorry Paul, I appreciate your standards-driven perspective, but in
> this area I'd rather build in more flexibility than strictly needed,
> than too little. If it turns out that on a particular platform all
> files are in UTF-8, making Python *on that platform* always choose
> UTF-8 is simple enough.

The problem is not the systems where all files are UTF-8, or all files are
another known charset. The problem is the platforms where half of the files
are UTF-8 and half are in some other charset, determined either by type or by
presence of a UTF-8 BOM. This is a *very* common situation, especially for
European users.

Such a user cannot set the locale to UTF-8, because that will break all of
their non-Unicode-aware applications. The Unicode-aware applications typically
have much better support for reading and writing files in charsets that are
not the system default. So in practice the locale has to be set to the "old"
charset during a migration to UTF-8.

(Setting different locales for different applications is far too much hassle.
On Windows, although I believe it is technically possible to do the equivalent
of selecting a UTF-8 locale, most users don't know how to do it, even if they
want to use UTF-8 exclusively.)

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>