PEP 263 status check

Fri Aug 6 07:42:23 EDT 2004

"Martin v. Löwis" <martin at v.loewis.de> wrote in message
news:41133C76.8040302 at v.loewis.de...
> John Roth wrote:

> > My specific question there was how the code handles the
> > combination of UTF-8 as the encoding and a non-ascii
> > character in an 8-bit string literal. Is this an error? The
> > PEP does not say so. If it isn't, what encoding will
> > it use to translate from unicode back to an 8-bit
> > encoding?
>
> UTF-8 is not in any way special wrt. the PEP.

That's what I thought.

> Notice that
> UTF-8 is *not* Unicode - it is an encoding of Unicode, just
> like ISO-8559-1 or us-ascii (although the latter two only
> encode a subset of Unicode).

I disagree, but I think this is a definitional issue.

> Yes, the byte string literals
> will be converted back to an "8-bit encoding", but the 8-bit
> encoding will be UTF-8! IOW, byte string literals are always
> converted back to the source encoding before execution.

If I understand you correctly, if I put, say, a mixture of
Cyrillic, Hebrew, Arabic and Greek into a byte string
literal, at run time that character string will contain the
proper unicode at each character position?

Or are you trying to say that the character string will
contain the UTF-8 encoding of these characters; that
is, if I do a subscript, I will get one character of the
multi-byte encoding?

The point of this is that I don't think that either behavior
is what one would expect. It's also an open invitation
for someone to make an unchecked mistake! I think this
may be Hallvard's underlying issue in the other thread.

> Regards,
> Martin

John Roth