PEP 263 status check

Fri Aug 6 22:09:59 EDT 2004

"Martin v. Löwis" <martin at v.loewis.de> wrote in message
news:4114070F.90507 at v.loewis.de...
> John Roth wrote:
> > Martin, I think you misinterpreted what I said at the
> > beginning. I'm only, and I need to repeat this, ONLY
> > dealing with the case where the encoding declaration
> > specifically says that the script is in UTF-8. No other
> > case.
>
>  From the viewpoint of PEP 263, there is absolutely *no*,
> and I repeat NO difference between chosing UTF-8 and
> chosing windows-1252 as the source encoding.

I don't believe I ever said that PEP 263 said there was
a difference. If I gave you that impression, I will
appologize if you can show me where it I did it.

> > I'm going to deal with your response point by point,
> > but I don't think most of this is really relevant. Your
> > response only makes sense if you missed the point that
> > I was talking about scripts that explicitly declared their
> > encoding to be UTF-8, and no other scripts in no
> > other circumstances.
>
> I don't understand why it is desirable to single out
> UTF-8 as a source encoding. PEP 263 does no such thing,
> except for allowing an addition encoding declaration
> for UTF-8 (by means of the UTF-8 signature).

As far as I'm concerned, what PEP 263 says is utterly
irrelevant to the point I'm trying to make.

The only connection PEP 263 has to the entire thread
(at least from my view) is that I wanted to check on
whether phase 2, as described in the PEP, was
scheduled for 2.4. I was under the impression it was
and was puzzled by not seeing it. You said it wouldn't
be in 2.4. Question answered, no further issue on
that point (but see below for an additonal puzzlement.)

> > I didn't mean the entire source was in 7-bit ascii. What
> > I meant was that if the encoding was utf-8 then the source
> > for 8-bit string literals must be in 7-bit ascii. Nothing more.
>
> PEP 263 never says such a thing. Why did you get this impression
> after reading it?

I didn't get it from the PEP. I got it from what you said. Your
response seemed to make sense only if you assumed that I
had this totally idiotic idea that we should change everything
to 7-bit ascii. That was not my intention.

Let's go back to square one and see if I can explain my
concern from first principles.

8-bit strings have a builtin assumption that one
byte equals one character. This is something that
is ingrained in the basic fabric of many programming
languages, Python included. It's a basic assumption
in the string module, the string methods and all through
just about everything, and it's something that most
programmers expect, and IMO have every right
to expect.

Now, people violate this assumption all the time,
for a number of reasons, including binary data and
encoded data (including utf-8 encodings)
but they do so deliberately, knowing what they're
doing. These particular exceptions don't negate the
rule.

The problem I have is that if you use utf-8 as the
source encoding, you can suddenly drop multi-byte
characters into an 8-bit string ***BY ACCIDENT***.
This accident is not possible with single byte
encodings, which is why I am emphasizing that I
am only talking about source that is encoded in utf-8.
(I don't know what happens with far Eastern multi-byte
encodings.)

UTF-8 encoded source has this problem. Source
encoded with single byte encodings does not have
this problem. It's as simple as that. Accordingly
it is not my intention, and has never been my
intention, to change the way 8-bit string literals
are handled when the source program has a
single byte encoding.

We may disagree on whether this is enough of
a problem that it warrents a solution. That's life.

Now, my suggested solution of this problem was
to require that 8-bit string literals in source that was
encoded with UTF-8 be restricted to the 7-bit
ascii subset. The reason is that there are logically
three things that can be done here if we find a
character that is outside of the 7-bit ascii subset.

One is to do the current practice and violate the
one byte == one character invariant, the second
is to use some encoding to convert the non-ascii
characters into a single byte encoding, thus
preserving the one byte == one character invariant.
The third is to prohibit anything that is ambiguous,
which in practice means to restrict 8-bit literals
to the 7-bit ascii subset (plus hex escapes, of course.)

The second possibility begs the question of what
encoding to use, which is why I don't seriously
propose it (although if I understand Hallvard's
position correctly, that's essentially his proposal.)

> *If* you understood that byte string literals can have the full
> power of the source encoding, plus hex-escaping, I can't see what
> made you think that power did not apply if the source encoding
> was UTF-8.

I think I covered that adequately above. It's not that
it doesn't apply, it's that it's unsafe.

> > It's predictable, but as far as I'm concerned, that's
> > not only useless behavior, it's counterproductive
> > behavior. I find it difficult to imagine any case
> > where the benefit of having normal character
> > literals accidentally contain utf-8 multi-byte
> > characters outweighs the pain of having it happen
> > accidentally, and then figuring out why your program
> > is giving you wierd behavior.
>
> Might be. This is precisely the issue that Hallvard is addressing.
> I agree there should be a mechanism to check whether all significant
> non-ASCII characters are inside Unicode literals.

I think that means we're in substantive agreement (although
I see no reason to restrict comments to 7-bit ascii.)

> I personally would prefer a command line switch over a per-file
> declaration, but that would be the subject of Hallvard's PEP.
> Under no circumstances I would disallow using the full source
> encoding in byte strings, even if the source encoding is UTF-8.

I assume here you intended to mean strings, not literals. If
so, we're in agreement - I see absolutely no reason to even
think of suggesting a change to Python's run time string
handling behavior.

> > There's no reason why you have to have a utf-8
> > encoding declaration. If you want your source to
> > be utf-8, you need to accept the consequences.
>
> Even for UTF-8, you need an encoding declaration (although
> the UTF-8 signature is sufficient for that matter). If
> there is no encoding declaration whatsoever, Python will
> assume that the source is us-ascii.

I think I didn't say this clearly. What I intended to get across
is that there isn't any major reason for a source to be utf-8;
other encodings are for the most part satisfactory.
Saying something about the declaration seems to have muddied
the meaning.

The last sentence puzzles me. In 2.3, absent a declaration
(and absent a parameter on the interpreter) Python assumes
that the source is Latin-1, and phase 2 was to change
this to the 7-bit ascii subset (US-Ascii). That was the
original question at the start of this thread. I had assumed
that change was to go into 2.4, your reply made it seem
that it would go into 2.5 (maybe.) This statement makes
it seem that it is the current state in 2.3.

> > I fully expect Python to support the usual mixture
> > of encodings until 3.0 at least. At that point, everything
> > gets to be rewritten anyway.
>
> I very much doubt that, in two ways:
> a) Python 3.0 will not happen, in any foreseeable future

I probably should let this sleeping dog lie, however,
there is a general expectation that there will be a 3.0
at some point before the heat death of the universe.
I was certainly under that impression, and I've seen
nothing from anyone who I regard as authoratitive until
this statement that says otherwise.

> b) if it happens, much code will stay the same, or only
>     require minor changes. I doubt that non-UTF-8 source
>     encoding will be banned in Python 3.
>
> > Were you able to write your entire program in UTF-8?
> > I think not.
>
> What do you mean, your entire program? All strings?
> Certainly you were. Why not?
>
> Of course, before UTF-8 was an RFC, there were no
> editors available, nor would any operating system
> support output in UTF-8, so you would need to
> organize everything on your own (perhaps it was
> simpler on Plan-9 at that time, but I have never
> really used Plan-9 - and you might have needed
> UTF-1 instead, anyway).

This doesn't make sense in context. I'm not talking
about some misty general UTF-8. I'm talking
about writing Python programs using the c-python
interpreter. Not jython, not IronPython, not some
other programming language.
Specifically, what would the Python 2.2 interpreter
have done if I handed it a program encoded in utf-8?
Was that a legitimate encoding? I don't know whether
it was or not. Clearly it wouldn't have been possible
before the unicode support in 2.0.

John Roth

>
> Regards,
> Martin