PEP 263 status check

Fri Aug 6 18:32:47 EDT 2004

John Roth wrote:
> Martin, I think you misinterpreted what I said at the
> beginning. I'm only, and I need to repeat this, ONLY
> dealing with the case where the encoding declaration
> specifically says that the script is in UTF-8. No other
> case.

 From the viewpoint of PEP 263, there is absolutely *no*,
and I repeat NO difference between chosing UTF-8 and
chosing windows-1252 as the source encoding.

> I'm going to deal with your response point by point,
> but I don't think most of this is really relevant. Your
> response only makes sense if you missed the point that
> I was talking about scripts that explicitly declared their
> encoding to be UTF-8, and no other scripts in no
> other circumstances.

I don't understand why it is desirable to single out
UTF-8 as a source encoding. PEP 263 does no such thing,
except for allowing an addition encoding declaration
for UTF-8 (by means of the UTF-8 signature).

> I didn't mean the entire source was in 7-bit ascii. What
> I meant was that if the encoding was utf-8 then the source
> for 8-bit string literals must be in 7-bit ascii. Nothing more.

PEP 263 never says such a thing. Why did you get this impression
after reading it?

*If* you understood that byte string literals can have the full
power of the source encoding, plus hex-escaping, I can't see what
made you think that power did not apply if the source encoding
was UTF-8.

> L:ikewise. I never thought otherwise; in fact I'd like to expand
> the availible operators to include the set operators as well as
> the logical operators and the "real" division operator (the one
> you learned in grade school - the dash with a dot above and
> below the line.)

That would be a different PEP, though, and I doubt Guido will be
in favour. However, this is OT for this thread.

> It's predictable, but as far as I'm concerned, that's
> not only useless behavior, it's counterproductive
> behavior. I find it difficult to imagine any case
> where the benefit of having normal character
> literals accidentally contain utf-8 multi-byte
> characters outweighs the pain of having it happen
> accidentally, and then figuring out why your program
> is giving you wierd behavior.

Might be. This is precisely the issue that Hallvard is addressing.
I agree there should be a mechanism to check whether all significant
non-ASCII characters are inside Unicode literals.

I personally would prefer a command line switch over a per-file
declaration, but that would be the subject of Hallvard's PEP.
Under no circumstances I would disallow using the full source
encoding in byte strings, even if the source encoding is UTF-8.

> There's no reason why you have to have a utf-8
> encoding declaration. If you want your source to
> be utf-8, you need to accept the consequences.

Even for UTF-8, you need an encoding declaration (although
the UTF-8 signature is sufficient for that matter). If
there is no encoding declaration whatsoever, Python will
assume that the source is us-ascii.

> I fully expect Python to support the usual mixture
> of encodings until 3.0 at least. At that point, everything
> gets to be rewritten anyway.

I very much doubt that, in two ways:
a) Python 3.0 will not happen, in any foreseeable future
b) if it happens, much code will stay the same, or only
    require minor changes. I doubt that non-UTF-8 source
    encoding will be banned in Python 3.

> Were you able to write your entire program in UTF-8?
> I think not.

What do you mean, your entire program? All strings?
Certainly you were. Why not?

Of course, before UTF-8 was an RFC, there were no
editors available, nor would any operating system
support output in UTF-8, so you would need to
organize everything on your own (perhaps it was
simpler on Plan-9 at that time, but I have never
really used Plan-9 - and you might have needed
UTF-1 instead, anyway).

Regards,
Martin