PEP 263 status check

John Roth newsgroups at jhrothjr.com
Fri Aug 6 17:32:48 EDT 2004


"Hallvard B Furuseth" <h.b.furuseth at usit.uio.no> wrote in message
news:HBF.20040806qchc at bombur.uio.no...
> An addition to Martin's reply:
>
> John Roth wrote:
> >"Martin v. Löwis" <martin at v.loewis.de> wrote in message
> >news:41137799.70808 at v.loewis.de...
> >>John Roth wrote:
> >>
> >> To be more specific: In an UTF-8 source file, doing
> >>
> >> print "ö" == "\xc3\xb6"
> >> print "ö"[0] == "\xc3"
> >>
> >> would print two times "True", and len("ö") is 2.
> >> OTOH, len(u"ö")==1.
> >>
> >>> The point of this is that I don't think that either behavior
> >>> is what one would expect. It's also an open invitation
> >>> for someone to make an unchecked mistake! I think this
> >>> may be Hallvard's underlying issue in the other thread.
> >>
> >> What would you expect instead? Do you think your expectation
> >> is implementable?
> >
> > I'd expect that the compiler would reject anything that
> > wasn't either in the 7-bit ascii subset, or else defined
> > with a hex escape.
>
> Then you should also expect a lot of people to move to
> another language - one whose designers live in the real
> world instead of your Utopian Unicode world.

Rudeness objection to your characteization.

Please see my response to Martin - I'm talking only,
and I repeat ONLY, about scripts that explicitly
say they are encoded in utf-8. Nothing else. I've
been in this business for close to 40 years, and I'm
quite well aware of backwards compatibility issues
and issues with breaking existing code.

Programmers in general have a very strong, and
let me repeat that, VERY STRONG assumption
that an 8-bit string contains one byte per character
unless there is a good reason to believe otherwise.
This assumption is built into various places, including
all of the string methods.

The current design allows accidental inclusion of
a character that is not in the 7bit ascii subset ***IN
A PROGRAM THAT HAS A UTF-8 CHARACTER
ENCODING DECLARATION*** to break that
assumption without any kind of notice. That in
turn will break all of the assumptions that the string
module and string methods are based on. That in
turn is likely to break lots of existing modules and
cause a lot of debugging time that could be avoided
by proper design.

One of Python's strong points is that it's difficult
to get into trouble unless you deliberately try (then
it's quite easy, fortunately.)

I'm not worried about this causing people to
abandon Python. I'm more worried about the
current situation causing enough grief that people
will decided that utf-8 source code encoding isn't
worth it.

> And tell me why I shouldn't be allowed to work easily with raw
> UTF-8 strings, if I do use coding:utf-8.

First, there's nothing that's stopping you. All that
my proposal will do is require you to do a one
time conversion of any strings you put in the
program as literals. It doesn't affect any other
strings in any other way at any other time.

I'll withdraw my objection if you can seriously
assure me that working with raw utf-8 in
8-bit character string literals is what most programmers
are going to do most of the time.

I'm not going to accept the very common need
of converting unicode strings to 8-bit strings so
they can be written to disk or stored in a data base
or whatnot (or reversing the conversion for reading.)
That has nothing to do with the current issue - it's
something that everyone who deals with unicode
needs to do, regardless of the encoding of the
source program.

John Roth
>
> -- 
> Hallvard





More information about the Python-list mailing list