[Python-3000] locale-aware strings ?

Wed Sep 6 12:51:55 CEST 2006

"Paul Prescod" <paul at prescod.net> writes:

> Windows users do not "tell each program separately about the
> encoding." The encoding varies by file type.

There are lots of Unix file types which are based on text files
and their encoding is not specified explicitly.

> It makes no more sense to have a global variable that says "all of
> my files are Shift-JIS" than it does to say "all of my files are
> PowerPoint files."

Not all: it's just the default for text files.

> This is how real-world programs work. They shouldn't guess based on
> system global variables.

But they do. It's a fact which is impossible to change with a decree.
There is no place, other than the locale, which would suggest which
encoding is used in /etc files, or in the contents of environment
variables, or on the terminal. You might say that it's unfortunate,
but it's true.

At most you could advocate specifying new file formats with the
encoding in mind, like XML does. This doesn't enrich existing file
formats with that information.

Of course technically these formats are just sequences of bytes,
and most programs pass non-ASCII fragments around without looking
into them deeper. But as long as one tries to treat them as natural
language text, search them case-insensitively, embed text taken from
them in HTML files, then the encoding begins to matter, and there is
a general shift among programming languages to translate it on I/O
to a common format instead of dealing with encoded text on all levels.

> May I ask an empircal question? In your experience, what percentage
> of Macintosh users change the default encoding from US-ASCII to
> something specific to their culture?

I have no experience with Macintoshes at all.

> What percentage of Ubuntu users change it froom UTF-8 to something
> specific?

Why would it matter? If most of their programs use UTF-8, and it's
specified by the locale, then fine. My system uses mostly ISO-8859-2,
and it's also fine, as long as there is a way for the program to get
that information.

If a program can't read my text files or filenames or environment
variables or program invocation arguments, while they are encoded
according to the locale, then the program is broken.

If a file is not encoded using the encoding specified by the locale,
and I don't tell the program explicitly about the encoding, then it's
not the program's fault when it can't read that.

If a language requires extra steps in order to make the locale
encoding work, then it's unhelpful. Most programmers won't bother,
and their programs will work most of the time when they test it,
assuming they use it with English texts. Such programs suddenly break
when used in a non-English speaking country.

> If the answers are "few", then we are talking about a feature that
> will break Windows programs and offer little value to Unix and
> Macintosh users.

How does it break more programs than assuming ASCII does? All
encodings suitable as a system encoding are ASCII supersets, so if
a file can't be read using the locale encoding, it can't be read
in ASCII either.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/