[Python-3000] Unicode identifiers (Was: sets in P3K?)

Sat Apr 29 18:40:06 CEST 2006

On 4/28/06, "Martin v. Löwis" <martin at v.loewis.de> wrote:
> Guido van Rossum wrote:
> >> The distinction of letters and digits is also straight-forward:
> >> a digit is ASCII [0-9]; it's a separate lexical class only
> >> because it plays a special role in (number) literals. More
> >> generally, there is the distinction of starter and non-starter
> >> characters.
> >
> > But Unicode has many alternative sets digits for which "isdigit" is true.
>
> You mean, the Python isdigit() method? Sure, but the tokenizer uses
> the C isdigit function, which gives true only for [0-9].

Isn't that because it's only defined on 8-bit characters though?

And if we're talking about Unicode, why shouldn't we use the Unicode
isdigit()? After all you were talking about the Unicode consortium's
rules for which characters can be part of identifiers.

> FWIW, POSIX
> allows 6 alternative characters to be defined as hexdigits for
> isxdigit, so the tokenizer shouldn't really use isxdigit for
> hexadecimal literals.

I think if we're talking Unicode, POSIX is irrelevant though, right?

> So from the implementation point of view, nothing much would have
> to change: the usage of isalnum in the tokenizer is already wrong,
> as it already allows to put non-ASCII characters into identifiers,
> if the locale classifies them as alpha-numeric.

But we force the locale to be C, right? I've never heard of someone
who managed to type non-ASCII letters into identifiers, and I'm sure
it would've been reported as a bug.

> I can't see why the Unicode notion of digits should affect the
> language specification in any way. The notion of digit is only
> used to define what number literals are, and I don't propose
> to change the lexical rules for number literals - I propose
> to change the rules for identifiers.

Well identifiers can contain digits too.

> > You can as far a the lexer is concerned because the lexer treats
> > keywords as "just" identifiers. Only the parser knows which ones are
> > really keywords.
>
> Right. But if the identifier syntax was
> [:identifier_start:][:identifier_cont:]*
> then thinks would work out just fine: identifier_start intersected
> with ASCII would be [A-Za-z_], and identifier_cont intersected
> with ASCII would be [A-Za-z0-9_]; this would include all keywords.
> You would still need punctuation between two subsequent
> "identifiers", and that punctuation would have to be ASCII, as
> non-ASCII characters would be restricted to comments, string
> literals, and identifiers.

OK, I trust you that it can be made to work.

But regardless, I really don't like it. I expect we'd be getting tons
of questions on c.l.py about programs with identifiers containing
squiggles we can neither read nor type, and for which we may not even
have the fonts or the display capabilities (if it's right-to-left
script).

I do think that *eventually* we'll have to support this. But I don't
think Python needs to lead the pack here; I don't think the tools are
ready yet.

--
--Guido van Rossum (home page: http://www.python.org/~guido/)