[Python-3000] Support for PEP 3131

Sat May 26 09:42:57 CEST 2007

Jim Jewett writes:

 > > How about a regexp character class as starting point?
 > 
 > I'm not sure I understand.  Do you mean that part of localization
 > should be defining what certain regular expressions should match?

No, I meant simply a list of character ranges, as characters.  The
definition of "safe ASCII" would be something like

    r"\t\r\n -~"

Your table format is better.  If people want to put the actual
characters in comments (maybe in source files to be preprocessed
before installation), let them.

 > So long as we allow tailoring, I think the maximal set should be
 > generous -- and I don't see any reason to pre-exclude anything outside
 > ASCII.

Cf characters?  Are we admitting "stupid bidi tricks", too?<wink>

But I'll tell you what my reason is: we want to be in a position to
avoid prohibiting previously acceptable characters wherever possible.

 > There are people who like to use names like "Program Files" or
 > "Summary of Results.Apr-3-2007 version 2.xls"; I expect the same will
 > be true of identifiers.  So long as the punctuation is not ASCII, we
 > might as well let them.

Why not let them use ASCII punctuation, as long as it's not Python
syntax?

Ie, for one thing, we might want to do something with that punctuation
some day.  For example, I could imagine using guillemots to denote
rawstrings or to substitute for triple quotes.  Local parsing (as done
by program editors) would be easier with directed quotes.  Etc.  For
reasons of visual distinctiveness, we might choose to use Chinese or
Arabic versions.

 > The other committees say to exclude certain scripts, like Linear B and
 > Ogham.  And not to allow mixed scripts, at least if they're
 > confusable.  But I really don't want to explain why someone using
 > Cyrillic can't use certain (apparently to him) randomly determined
 > identifiers just because it could be confused with ASCII (or
 > Armenian).

-1 on restrictions according to confusability or the block.  That's a
matter for personal judgement, and there are cheap technical solutions
for those who want to use confusable Cyrillic or Linear B and still
avoid confusion.  I think those restrictions are an idea that must be
available (perhaps as a table we distribute), but I think they'll turn
out to suck pretty badly.

 > If unicode comes out with a new revision, the new characters should
 > probably be allowed; I don't want a situation where users of Cham or
 > Lepcha[1] are told they have to wait another year because their
 > scripts weren't formally adopted into unicode until after python 3.4.0
 > was already released.

Tough call.  I'd say, let's cross that bridge when we come to it.

In any case there will have to be some mechanism to access a Unicode
database at either build time or run time.  Let them munge that
database if they're in a hurry.

Maybe the way to handle this is to allow private-space characters in
identifiers as an option.  That would be doable with your well-known
file scheme.  But it's very dangerous across modules.

By the way, this is what the Japanese call the "gaiji" ("outside
character") problem.  It's a very tough nut to crack; the Japanese
never did.