[Python-3000] Support for PEP 3131

Adam Olsen rhamph at gmail.com
Sat May 26 01:29:34 CEST 2007


On 5/25/07, Jim Jewett <jimjjewett at gmail.com> wrote:
> On 5/25/07, Adam Olsen <rhamph at gmail.com> wrote:
> > On 5/25/07, Jim Jewett <jimjjewett at gmail.com> wrote:
> > > On 5/25/07, Adam Olsen <rhamph at gmail.com> wrote:
> > > > If we allowed an underscore as a mixed-script separator
> > > > (allowing "def get_原料(self):"), does this let us get away
> > > > with otherwise banning mixed-scripts?
>
> ...
>
> > Indeed, the whole-script confusables does create significant
> > holes, but I think the best solution is still to ban mixed-scripts
> > and accept that it's only a "75% solution".  Using an "I'm
> > expecting cyrillic" flag makes it harder for those who need
> > cyrillic AND still leaves them vulnerable to the same problem
> > we're trying to protect ourselves from.
>
> hmm... I had thought they should either not include the confusable
> letters, or use different fonts -- whatever they normally do.

I don't understand.  Are you suggesting that those typing in russian
or ukrainian should switch from cyrillic to latin when typing in 'a'?
Surely I misunderstand.

But as for how likely accidental confusion is, to provide statistics I
installed a ukrainian wordlist and grepped it for words that only
contained characters resembling lowercase latin characters (in my
font).  Of 990736 entries, only 133 matched.  Of those, only one of
them looked like an english word: a lone 'i'.  I'm tempted to suggest
special-casing it, but if that's the worst problem in all of this I
think it can wait until it's proven to be a problem.


> But I suppose using an _ separator could still be a useful crutch.
> Whether it is useful enough ... I'll let others chime in.

Using _ as a separator is only intended to allow fixed prefixes (or
suffixes) for arbitrary names[1].  I don't see how this becomes a
crutch.


[1] urllib2 uses this style, although it's unlikely to ever have
non-ascii names.  Still, I don't think we should limit the style.

> > A more extreme solution would be to introduce a symbol type that
> > converts that converts whole-script confusables to a canonical
> > form
>
> The unicode consortium recommends against this.  I'm not sure if it is
> just a presentation issue, or concerns about compatibility; the
> "confusables" lists are explicitly allowed to change.

Having the equivalences change between python versions (assuming at
least this aspect is hardcoded) would be quite troublesome.  Perhaps
even moreso than the confusion it's intended to prevent!

-- 
Adam Olsen, aka Rhamphoryncus


More information about the Python-3000 mailing list