PEP 263 status check

Fri Aug 6 22:25:56 EDT 2004

"Martin v. Löwis" <martin at v.loewis.de> wrote in message
news:41140A61.3040600 at v.loewis.de...
> John Roth wrote:
> > I've
> > been in this business for close to 40 years, and I'm
> > quite well aware of backwards compatibility issues
> > and issues with breaking existing code.
> >
> > Programmers in general have a very strong, and
> > let me repeat that, VERY STRONG assumption
> > that an 8-bit string contains one byte per character
> > unless there is a good reason to believe otherwise.
>
> You clearly come from a Western business. In CJK
> languages, people are very aware that characters can
> have more than one byte. They consider UTF-8 as just
> another multi-byte encoding, and used to consider it
> as an encoding that Westerners made to complicate their
> lifes. That attitude appears to be changing now, but
> UTF-8 is not a clear winner in the worlds where we
> Westerners would expect it to be a clear winner.

I'm aware of that.

> > The current design allows accidental inclusion of
> > a character that is not in the 7bit ascii subset ***IN
> > A PROGRAM THAT HAS A UTF-8 CHARACTER
> > ENCODING DECLARATION*** to break that
> > assumption without any kind of notice.
>
> This is a problem only for the Western world. In the
> CJK languages, such programs were broken a long time
> ago. I don't think Python needs to be so Americo-centric
> as to protect American programmers from programming
> mistakes.

American != non East Asian.

In fact, I would consider American programmers to
be the least prone to making this kind of mistake
simply because all standard characters are included
in the US-Ascii subset. It's much more likely to be
a European (or non North American) problem.
Even when writing in English, people's names will
have non-English characters, and they have a
tendency to leak into literals.
(Mexico considers themselves to be part of
Central America, for some political reason.)

> > That in
> > turn will break all of the assumptions that the string
> > module and string methods are based on. That in
> > turn is likely to break lots of existing modules and
> > cause a lot of debugging time that could be avoided
> > by proper design.
>
> Indeed. If the program is currently not broken, why
> are you changing the source encoding? If you are
> trying to support multiple languages, a properly-
> designed application would use gettext instead
> of putting non-ASCII into source code.
>
> If you are writing a new application, and you
> put non-ASCII into the source, in UTF-8, are you
> not testing your application properly?
>
> > I'm not worried about this causing people to
> > abandon Python. I'm more worried about the
> > current situation causing enough grief that people
> > will decided that utf-8 source code encoding isn't
> > worth it.
>
> Again, this is what Hallvard's PEP is for. It
> does not apply to UTF-8 only, but I see no reason
> why UTF-8 needs to be singled out.
>
> > I'll withdraw my objection if you can seriously
> > assure me that working with raw utf-8 in
> > 8-bit character string literals is what most programmers
> > are going to do most of the time.
>
> In what time scale? Near time, most people will use
> other source encodings. In the medium term, I expect
> Unix will switch to UTF-8 throughout, at which point
> using UTF-8 byte strings will work on every Unix
> system - the scripts, by nature, won't work on non-Unix
> systems, anyway. In the long term, I expect all Python
> strings will be Unicode strings, unless explicitly
> declared as byte strings.

I asked Hallvard this question, not you. It makes sense
in the context of the statements of his I was responding to.

Your answer does not make sense. Hallvard's objection
was that he actually wanted to have non-ascii characters
put into byte literals in their utf-8 encoded forms (at least
as I understand it.)

If I thought about it, I could undoubtedly come up with
use cases where I would find this behavior useful. The
presupposition behind my statement was that those
use cases were overwhelmingly less likely than the
standard uses of byte string literals where a utf-8
encoded "character" would be a problem.

John Roth

>
> Regards,
> Martin