PEP 263 status check

Fri Aug 6 17:06:53 EDT 2004

"Martin v. Löwis" <martin at v.loewis.de> wrote in message
news:4113D8DF.8080106 at v.loewis.de...
> John Roth wrote:
> >>What would you expect instead? Do you think your expectation
> >>is implementable?
> >
> >
> > I'd expect that the compiler would reject anything that
> > wasn't either in the 7-bit ascii subset, or else defined
> > with a hex escape.
>
> Are we still talking about PEP 263 here? If the entire source
> code has to be in the 7-bit ASCII subset, then what is the point
> of encoding declarations?

Martin, I think you misinterpreted what I said at the
beginning. I'm only, and I need to repeat this, ONLY
dealing with the case where the encoding declaration
specifically says that the script is in UTF-8. No other
case.

I'm going to deal with your response point by point,
but I don't think most of this is really relevant. Your
response only makes sense if you missed the point that
I was talking about scripts that explicitly declared their
encoding to be UTF-8, and no other scripts in no
other circumstances.

I didn't mean the entire source was in 7-bit ascii. What
I meant was that if the encoding was utf-8 then the source
for 8-bit string literals must be in 7-bit ascii. Nothing more.

> If you were suggesting that anything except Unicode literals
> should be in the 7-bit ASCII subset, then this is still
> unacceptable: Comments should also be allowed to contain non-ASCII
> characters, don't you agree?

Of course.

> If you think that only Unicode literals and comments should be
> allowed to contain non-ASCII, I disagree: At some point, I'd
> like to propose support for non-ASCII in identifiers. This would
> allow people to make identifiers that represent words from their
> native language, which is helpful for people who don't speak
> English well.

L:ikewise. I never thought otherwise; in fact I'd like to expand
the availible operators to include the set operators as well as
the logical operators and the "real" division operator (the one
you learned in grade school - the dash with a dot above and
below the line.)

> If you think that only Unicod literals, comments, and identifiers
> should be allowed non-ASCII: perhaps, but this is out of scope
> of PEP 263, which *only* introduces encoding declarations,
> and explains what they mean for all current constructs.
>
> > The reason for this is simply that wanting to put characters
> > outside of the 7-bit ascii subset into a byte character string
> > isn't portable.
>
> Define "is portable". With an encoding declaration, I can move
> the source code from one machine to another, open it in an editor,
> and have it display correctly. This was not portable without
> encoding declarations (likewise for comments); with PEP 263,
> such source code became portable.

> Also, the run-time behaviour is fully predictable (which it
> even was without PEP 263): At run-time, the string will have
> exactly the same bytes that it does in the .py file. This
> is fully portable.

It's predictable, but as far as I'm concerned, that's
not only useless behavior, it's counterproductive
behavior. I find it difficult to imagine any case
where the benefit of having normal character
literals accidentally contain utf-8 multi-byte
characters outweighs the pain of having it happen
accidentally, and then figuring out why your program
is giving you wierd behavior.

I would grant that there are cases where you
might want this behavior. I am pretty sure they
are in the distinct minority.

> > It just pushes the need for a character set
> > (encoding) declaration down one level of recursion.
>
> It depends on the program. E.g. if the program was to generate
> HTML files with an explicit HTTP-Equiv charset=iso-8859-1,
> then the resulting program is absolutely, 100% portable.

It's portable, but that's not the normal case. See above.

> For messages directly output to a terminal, portability
> might not be important.

Portabiliity is less of an issue for me than the likelihood
of making a mistake in coding a literal and then having
to debug unexpected behavior when one byte no longer
equals one character.

> > There's already a way of doing this: use a unicode string,
> > so it's not like we need two ways of doing it.
>
> Using a Unicode string might not work, because a library might
> crash when confronted with a Unicode string. You are proposing
> to break existing applications for no good reason, and with
> no simple fix.

There's no reason why you have to have a utf-8
encoding declaration. If you want your source to
be utf-8, you need to accept the consequences.
I fully expect Python to support the usual mixture
of encodings until 3.0 at least. At that point, everything
gets to be rewritten anyway.

> > Now I will grant you that there is a need for representing
> > the utf-8 encoding in a character string, but do we need
> > to support that in the source text when it's much more
> > likely that it's a programming mistake?
>
> But it isn't! People do put KOI-8R into source code, into
> string literals, and it works perfectly fine for them. There
> is no reason to arbitrarily break their code.
>
> > As far as implementation goes, it should have been done
> > at the beginning. Prior to 2.3, there was no way of writing
> > a program using the utf-8 encoding (I think - I might be
> > wrong on that)
>
> You are wrong. You were always able to put UTF-8 into byte
> strings, even at a time where UTF-8 was not yet an RFC
> (say, in Python 1.1).

Were you able to write your entire program in UTF-8?
I think not.

>
> > so there were no programs out there that
> > put non-ascii subset characters into byte strings.
>
> That is just not true. If it were true, there would be no
> need to introduce a grace period in the PEP. However,
> *many* scripts in the world use non-ASCII in string literals;
> it was always possible (although the documentation was
> wishy-washy on what it actually meant).
>
> > Today it's one more forward migration hurdle to jump over.
> > I don't think it's a particularly large one, but I don't have
> > any real world data at hand.
>
> Trust me: the outcry for banning non-ASCII from string literals
> would be, by far, louder than the one for a proposed syntax
> on decorators. That would break many production systems, CGI
> scripts would suddenly stop working, GUIs would crash, etc.

.

>
> Regards,
> Martin