[Python-3000] Support for PEP 3131

Jim Jewett jimjjewett at gmail.com
Fri May 25 18:03:59 CEST 2007


On 5/25/07, Adam Olsen <rhamph at gmail.com> wrote:
> On 5/23/07, Jim Jewett <jimjjewett at gmail.com> wrote:

> > > ... range of characters and languages allowed ...

> > Fair enough -- but the problem is that this isn't a solved issue
> > yet; the unicode group themselves make several contradictory
> > recommendations.

> > I can come up with rules that are probably just about right, but I
> > will make mistakes (just as the unicode consortium itself did,
> > which is why they have both ID and XID, and why both have
> > stability characters).  Even having read their reports, my initial
> > rules would still have banned mixed-script, which would have
> > prevented your edict-example.

> If we allowed an underscore as a mixed-script separator
> (allowing "def get_原料(self):"), does this let us get away
> with otherwise banning mixed-scripts?

I wondered that, until seeing that it wouldn't really solve the
problem anyhow.  It is possible to write entire words (such as "allow"
or "scope") in multiple scripts.  (Unicode calls these "whole script
confusables".)  You can't stop that without banning one of the scripts
entirely, which would disenfranche users of some languages.

So I think the least-bad solution is to say "OK, we won't allow these
potentially confusable characters unless you were expecting them."

And once we have a way to say "I'm expecting Cyrillic", we might as
well let the user specify exactly what they're expecting, and make
their own decisions on what it likely to be needed vs likely to be
confused.

For more information, see section 4 of

    http://www.unicode.org/reports/tr39/

and current likely problem characters at

    http://www.unicode.org/reports/tr39/data/confusables.txt
    http://www.unicode.org/reports/tr39/data/confusablesWholeScript.txt

-jJ


More information about the Python-3000 mailing list