[Python-3000] PEP 3131 roundup

Wed Jun 6 03:47:40 CEST 2007

On 6/5/07, Ka-Ping Yee <python at zesty.ca> wrote:
> > > G.  Should source code be required to be in normalized
> > > form?
...
> To your earlier question of "what about non-UTF-8 files", I
> imagine that the normalization restriction would apply to the
> decoded characters.  That is, once you know the source code
> encoding, there's a one-to-one mapping between the
> sequence of bytes in the source file and the sequence of
> characters to be parsed.

One of the unicode goals is that a given sequence of bytes in the
source encoding will round-trip to a corresponding sequence of bytes
in unicode.  But that corresponding sequence will not always be in
Normal form; normalization may prevent an (unchanged) round-trip.
Even if they can produce the "correct" form, it may not be as easy.
If someone's keyboard easily produces the "wrong" form, I don't want
to give them syntax errors for something that can be automatically
corrected.

> Thus, two references to the same identifier will be
> represented by exactly the same bytes in the source
> file (you can't have different byte sequences in the source
> file alias to the same identifier).

The bytes -- and possibly even the original character -- can still be
different between different files (with different encodings), even if
they reference the same (imported) identifier.  I think (limited,
source) aliasing is something we just have to accept with unicode.  I
believe the best we can do is to say:

    Python will normalize, so if two identifiers are
    canonically equivalent, you won't get any rare
    impossible-to-debug inequality showing as an
    AttributeError.

Ideally, that "canonical equivalence" would extend to strings (or at
least be done automatically before hashing).

Ideally, either that equivalence would also include compatibility, or
else characters whose compatibility and canonical equivalents are
different would be banned for use in identifiers.

-jJ