PEP 263 status check

Fri Aug 6 15:15:43 EDT 2004

John Roth wrote:
>>What would you expect instead? Do you think your expectation
>>is implementable?
> 
> 
> I'd expect that the compiler would reject anything that
> wasn't either in the 7-bit ascii subset, or else defined
> with a hex escape.

Are we still talking about PEP 263 here? If the entire source
code has to be in the 7-bit ASCII subset, then what is the point
of encoding declarations?

If you were suggesting that anything except Unicode literals
should be in the 7-bit ASCII subset, then this is still
unacceptable: Comments should also be allowed to contain non-ASCII
characters, don't you agree?

If you think that only Unicode literals and comments should be
allowed to contain non-ASCII, I disagree: At some point, I'd
like to propose support for non-ASCII in identifiers. This would
allow people to make identifiers that represent words from their
native language, which is helpful for people who don't speak
English well.

If you think that only Unicod literals, comments, and identifiers
should be allowed non-ASCII: perhaps, but this is out of scope
of PEP 263, which *only* introduces encoding declarations,
and explains what they mean for all current constructs.

> The reason for this is simply that wanting to put characters
> outside of the 7-bit ascii subset into a byte character string
> isn't portable. 

Define "is portable". With an encoding declaration, I can move
the source code from one machine to another, open it in an editor,
and have it display correctly. This was not portable without
encoding declarations (likewise for comments); with PEP 263,
such source code became portable.

Also, the run-time behaviour is fully predictable (which it
even was without PEP 263): At run-time, the string will have
exactly the same bytes that it does in the .py file. This
is fully portable.

> It just pushes the need for a character set
> (encoding) declaration down one level of recursion.

It depends on the program. E.g. if the program was to generate
HTML files with an explicit HTTP-Equiv charset=iso-8859-1,
then the resulting program is absolutely, 100% portable.

For messages directly output to a terminal, portability
might not be important.

> There's already a way of doing this: use a unicode string,
> so it's not like we need two ways of doing it.

Using a Unicode string might not work, because a library might
crash when confronted with a Unicode string. You are proposing
to break existing applications for no good reason, and with
no simple fix.

> Now I will grant you that there is a need for representing
> the utf-8 encoding in a character string, but do we need
> to support that in the source text when it's much more
> likely that it's a programming mistake?

But it isn't! People do put KOI-8R into source code, into
string literals, and it works perfectly fine for them. There
is no reason to arbitrarily break their code.

> As far as implementation goes, it should have been done
> at the beginning. Prior to 2.3, there was no way of writing
> a program using the utf-8 encoding (I think - I might be
> wrong on that)

You are wrong. You were always able to put UTF-8 into byte
strings, even at a time where UTF-8 was not yet an RFC
(say, in Python 1.1).

> so there were no programs out there that
> put non-ascii subset characters into byte strings.

That is just not true. If it were true, there would be no
need to introduce a grace period in the PEP. However,
*many* scripts in the world use non-ASCII in string literals;
it was always possible (although the documentation was
wishy-washy on what it actually meant).

> Today it's one more forward migration hurdle to jump over.
> I don't think it's a particularly large one, but I don't have
> any real world data at hand.

Trust me: the outcry for banning non-ASCII from string literals
would be, by far, louder than the one for a proposed syntax
on decorators. That would break many production systems, CGI
scripts would suddenly stop working, GUIs would crash, etc.

Regards,
Martin