[Python-3000] locale-aware strings ?

Tue Sep 5 22:48:27 CEST 2006

I have no desire to continue this discussion in every detail. I
believe we've both made our point, eloquently enough. The designers of
the I/O library will have to come up with the specific rules for
deciding on the default encoding. The only thing I'm saying is that
hardcoding the default encoding in the language standard (like we did
for str<-->unicode in 2.0) would be a mistake. I'm trusting that
building the more basic facilities (such as being able to pass an
explicit encoding to open()) first will enable us to experiment with
different ways of determining a default encoding. That makes more
sense to me than trying to settle this argument by raising our voices.
(And yes, I am building in the possibility that I'm wrong. But
he-said-she-said won't convince me; only actual usage experience.)

--Guido

On 9/5/06, Paul Prescod <paul at prescod.net> wrote:
> On 9/5/06, Guido van Rossum <guido at python.org> wrote:
>
> > On 9/5/06, Paul Prescod <paul at prescod.net> wrote:
> > > Beyond all of that: It just seems wrong to me that I could send someone
> a
> > > bunch of files and a Python program and their results processing them
> would
> > > be different from mine, despite the fact that we run the same version of
> > > Python on the same operating system.
> >
> > And it seems just as wrong if Python doesn't do what the user expects.
> > If I were a beginning Python user, I'd hate it if I had prepared a
> > simple data file in vi or notepad and my Python program wouldn't read
> > it right because Python's idea of encoding differs from my editor's.
>
>
> My point is that most textual content in the world is NOT produced in vi or
> notepad or other applications that read the system encoding. Most content is
> produced in Word (future Word files will be zipped Unicode, not opaque
> binary), OpenOffice, DreamWeaver, web services, gmail, Thunderbird, phpbb,
> etc.
>
> I haven't created locale-relevant content in a generic text editor in a
> very, very long time.
>
> Applications like vi and emacs that "help" you to create content that other
> people can't consume are not really helping at all. After all, we (now!)
> live in a networked era and people don't just create documents and then
> print them out on their local printers. Most of the time when I use text
> editors I am editing HTML, XML or Python and using the default of CP437 is
> wrong for all of those.
>
> Even Python will puke if you take a naive approach to text encodings in
> creating a Python program.
>
> sys:1: DeprecationWarning: Non-ASCII character '\xe0' in file
> c:\temp\testencoding.py on line 1, but no encoding declared; see
> http://www.python.org/peps/pep-0263.html for details
>
> Are you going to change the Python interpreter so that it will "just work"
> with content created in vi and notepad? Otherwise you're saying that Python
> will take a modern collaboration-roeitend approach to text processing but
> encourage Python programmers to take a naive obsolete approach.
>
> It also isn't just a question of flexibility. I think that Brian Quinlan
> made the good point that most English Windows users do not know what
> encoding their computer is using. If this represents 25% of the world's
> Python users, and these users run into UTF-8 data more often than CP437 then
> Python will guess wrong more often than it will guess right for 25% of its
> users. This is really dangerous because CP437 will happily read and munge
> UTF-8 (or even UCS-2 or binary) data. This makes CP437 a terrible default
> for that 25%.
>
> But it's worse than even that. GUI applications on Windows use a different
> encoding than command line ones. So on the same box, Python-in-Tk and
> Python-on-command line will answer that the system encoding is "cp437"
> versus "cp1252". I just tested it.
>
> http://blogs.msdn.com/oldnewthing/archive/2005/03/08/389527.aspx
>
> Were it not for these issue I would say that it "isn't a big deal" because
> modern Linux distributions are moving to UTF-8 default anyhow, and the Mac
> seems to use ASCII. So we're moving to international standards regardless.
> But default encoding on Windows is totally broken.
>
> The Mac is not totally consistent either. The console decodes UTF-8 for
> display. Textedit and vim munge the display in different ways (same GUI
> versus command-line issue again, I guess)
>
> A question: what happens when Python is reading data from a socket or other
> file-like object? Will that data also be decoded as if it came from the
> user's locale?
>
> I don't think that this discussion really has anything to do with being
> compatible with "most of the files on a computer". It is about being
> compatible with a certain set of Unix text processing applications.
>
>  Paul Prescod
>
>

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)