[Python-3000] Support for PEP 3131

Fri May 25 19:47:44 CEST 2007

On 5/25/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> Jim Jewett writes:

>  > Ideally, it would even be explicit per extra character allowed, though
>  > there should obviously be shortcuts to accept entire scripts.

> How about a regexp character class as starting point?

I'm not sure I understand.  Do you mean that part of localization
should be defining what certain regular expressions should match?
That sounds great from a consistency standpoint, but it would
certainly limit who could create their own reliable tailorings.

>  > So how about

>  > [ ASCII, plus chars in a named table]

> You can specify any character you want, but if it's ASCII, or not in
> the classes PEP 3131 ends up using to define the maximal set, it gets
> deleted from the extension table (ASCII has its own table,
> conceptually).  This permits whole scripts, blocks, or ranges to be
> included.

So long as we allow tailoring, I think the maximal set should be
generous -- and I don't see any reason to pre-exclude anything outside
ASCII.

There are people who like to use names like "Program Files" or
"Summary of Results.Apr-3-2007 version 2.xls"; I expect the same will
be true of identifiers.  So long as the punctuation is not ASCII, we
might as well let them.  (Internally, I expect some communities to say
"that is a bad idea" about certain characters, but *I* don't want to
prejudge which characters those will be.)

>  > If you want to include punctuation or

> Why waste the effort of the Unicode technical committees?

The other committees say to exclude certain scripts, like Linear B and
Ogham.  And not to allow mixed scripts, at least if they're
confusable.  But I really don't want to explain why someone using
Cyrillic can't use certain (apparently to him) randomly determined
identifiers just because it could be confused with ASCII (or
Armenian).

The only set the committees always recommend allowing is ASCII; beyond
that a nest of decisions (and exceptions) is almost unavoidable,
because the committees disagree among themselves.  Since we can't be
completely safe, I would rather err on the side of leniency towards
those concerned enough to make explicit decisions.

>  > undefined characters, so be it.

> -1

> Assuming undefined == reserved for future standardization that
> violates the Unicode standard.

If unicode comes out with a new revision, the new characters should
probably be allowed; I don't want a situation where users of Cham or
Lepcha[1] are told they have to wait another year because their
scripts weren't formally adopted into unicode until after python 3.4.0
was already released.

[1]  http://www.unicode.org/onlinedat/languages-scripts.html says that
these languages have their own scripts (and no alternate script), and
that these scripts have not yet been encoded in unicode.  I won't be
surprised to see Klingon identifiers before we see either of those,
but ... I don't want to contribute to their exclusion.

-jJ