Further changes to source encodings (Was: PEP 263 status check)

Sat Aug 7 07:26:58 EDT 2004

"Martin v. Löwis" <martin at v.loewis.de> wrote in message
news:411462b5$0$27020$9b622d9e at news.freenet.de...
> John Roth wrote:
> > I don't believe I ever said that PEP 263 said there was
> > a difference. If I gave you that impression, I will
> > appologize if you can show me where it I did it.
>
> In <10h5hgvpafm8a64 at news.supernews.com>, titled
> " PEP 263 status check", you write
>
> [quote]
> My specific question there was how the code handles the
> combination of UTF-8 as the encoding and a non-ascii
> character in an 8-bit string literal. Is this an error?
> [end quote]
>
> So I assumed you were all the time talking about how this
> is implemented, and how you expected to be implemented,
> and I assumed we agree that the implementation should
> match the specification in PEP 263.

Ah! While my assumption was that the code had been
implemented correctly according to the specification,
and that the specification leaves a trap for the unwary
in one very significant (although also very narrow) case.

> > As far as I'm concerned, what PEP 263 says is utterly
> > irrelevant to the point I'm trying to make.
>
> Then I don't know what the point is you are trying to
> make. It appears that you are now saying that Python
> does not work the way it should work. IOW, you are
> proposing that it be changed, right? This sounds like
> another PEP.

It could very well be another PEP.

>
> > 8-bit strings have a builtin assumption that one
> > byte equals one character.
>
> Not at all. Some 8-bit strings don't denote characters
> at all, and some 8-bit string, atleast in some regions
> of the world, are deliberately using multi-byte character
> encodings. In particular, UTF-8 is such an encoding.

This is true, but it's also beside the point. Most *programmers*
(other than ones that use single-language multi-byte
encodings) make that assumption. If they didn't there
wouldn't be a problem.

Every tutorial I've ever seen on unicode spends a great
deal of time at the beginning explaining the difference
between bytes, characters, encodings and all that stuff.
If this was common knowledge, why would the authors
bother? They bother simply because it isn't common
knowledge, at least in the sense that it's wired into
developer's common coding intuitions and habits.

> > The problem I have is that if you use utf-8 as the
> > source encoding, you can suddenly drop multi-byte
> > characters into an 8-bit string ***BY ACCIDENT***.

> Ok.

> > Now, my suggested solution of this problem was
> > to require that 8-bit string literals in source that was
> > encoded with UTF-8 be restricted to the 7-bit
> > ascii subset.
>
> Ok. I disagree that this is desirable; if you really
> want to see that happen, you should write a PEP.
>
> > The second possibility begs the question of what
> > encoding to use, which is why I don't seriously
> > propose it (although if I understand Hallvard's
> > position correctly, that's essentially his proposal.)
>
> No. He proposes your third alternative (ban non-ASCII
> characters in byte string literals), not just for UTF-8,
> but for all encodings. Not for all files, though, but
> only for selected files.

Which is what I don't like about it. It adds complexity
to the language and a feature that I don't think is really
necessary (restricting string literals for single-byte encodings.)
The other thing I don't like is that it still leaves the
trap for the unwary which I'm discussing.

> >>If
> >>there is no encoding declaration whatsoever, Python will
> >>assume that the source is us-ascii.
> [...]
> > The last sentence puzzles me. In 2.3, absent a declaration
> > (and absent a parameter on the interpreter) Python assumes
> > that the source is Latin-1, and phase 2 was to change
> > this to the 7-bit ascii subset (US-Ascii). That was the
> > original question at the start of this thread. I had assumed
> > that change was to go into 2.4, your reply made it seem
> > that it would go into 2.5 (maybe.) This statement makes
> > it seem that it is the current state in 2.3.
>
> With "will assume", I actually meant future tense. Not
> being a native speaker, I'm uncertain how to distinguish
> this from the conditional form that you apparently understood.

Ah. I understand now. I understood the final clause as a
form of present tense. To make it a future I'd probably
stick the word 'eventually' or 'in Release 2.5' in there:
"will eventually assume" or "In Release 2.5, Python will assume..."

> > Specifically, what would the Python 2.2 interpreter
> > have done if I handed it a program encoded in utf-8?
> > Was that a legitimate encoding?
>
> Yes, the Python interpeter would have processed it.
>
> print "Grüß Gott"
>
> would have send the greeting to the terminal.

I see your point here. It does round trip successfully.

John Roth
>
> Regards,
> Martin