PEP 263 status check

John Roth newsgroups at jhrothjr.com
Fri Aug 6 09:33:25 EDT 2004


"Martin v. Löwis" <martin at v.loewis.de> wrote in message
news:41137799.70808 at v.loewis.de...
> John Roth wrote:
> > Or are you trying to say that the character string will
> > contain the UTF-8 encoding of these characters; that
> > is, if I do a subscript, I will get one character of the
> > multi-byte encoding?
>
> Michael is almost right: this is what happens. Except that
> what you get, I wouldn't call a "character". Instead, it
> is always a single byte - even if that byte is part of
> a multi-byte character.
>
> Unfortunately, the things that constitute a byte string
> are also called characters in the literature.
>
> To be more specific: In an UTF-8 source file, doing
>
> print "ö" == "\xc3\xb6"
> print "ö"[0] == "\xc3"
>
> would print two times "True", and len("ö") is 2.
> OTOH, len(u"ö")==1.
>
> > The point of this is that I don't think that either behavior
> > is what one would expect. It's also an open invitation
> > for someone to make an unchecked mistake! I think this
> > may be Hallvard's underlying issue in the other thread.
>
> What would you expect instead? Do you think your expectation
> is implementable?

I'd expect that the compiler would reject anything that
wasn't either in the 7-bit ascii subset, or else defined
with a hex escape.

The reason for this is simply that wanting to put characters
outside of the 7-bit ascii subset into a byte character string
isn't portable. It just pushes the need for a character set
(encoding) declaration down one level of recursion.
There's already a way of doing this: use a unicode string,
so it's not like we need two ways of doing it.

Now I will grant you that there is a need for representing
the utf-8 encoding in a character string, but do we need
to support that in the source text when it's much more
likely that it's a programming mistake?

As far as implementation goes, it should have been done
at the beginning. Prior to 2.3, there was no way of writing
a program using the utf-8 encoding (I think - I might be
wrong on that) so there were no programs out there that
put non-ascii subset characters into byte strings.

Today it's one more forward migration hurdle to jump over.
I don't think it's a particularly large one, but I don't have
any real world data at hand.

John Roth
>
> Regards,
> Martin





More information about the Python-list mailing list