[Python-Dev] Python and the Unicode Character Database

Steven D'Aprano steve at pearwood.info
Mon Nov 29 00:43:59 CET 2010


Alexander Belopolsky wrote:
> Two recently reported issues brought into light the fact that Python
> language definition is closely tied to character properties maintained
> by the Unicode Consortium. [1,2]  For example, when Python switches to
> Unicode 6.0.0 (planned for the upcoming 3.2 release), we will gain two
> additional characters that Python can use in identifiers. [3]
[...]

Why do you consider this a problem? It would be a problem if previously 
valid identifiers *stopped* being valid, but not the other way around.


> Of course, the likelihood is low that this change will affect any
> user, but the change in str.isspace() reported in [1] is likely to
> cause some trouble:

Looking at the thread here:
http://bugs.python.org/issue10567

I interpret it as indicting that Python's isspace() has been buggy for 
many years, and is only now being fixed. It's always unfortunate when 
people rely on bugs, but I'm not sure we should be promising to support 
bug-for-bug compatibility from one version to the next :)


> While we have little choice but to follow UCD in defining
> str.isidentifier(), I think Python can promise users more stability in
> what it treats as space or as a digit in its builtins.   For example,
> I don't think that supporting
> 
>>>> float('١٢٣٤.٥٦')
> 1234.56
> 
> is more important than to assure users that once their program
> accepted some text as a number, they can assume that the text is
> ASCII.

Seems like a pretty foolish assumption, if you ask me, pretty much akin 
to assuming that if string.isalpha() returns true that string is ASCII.

Support for non-Arabic numerals in number strings goes back to at least 
Python 2.4:

[steve at sylar ~]$ python2.4
Python 2.4.6 (#1, Mar 30 2009, 10:08:01)
[GCC 4.1.2 20070925 (Red Hat 4.1.2-27)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
 >>> float(u'١٢٣٤.٥٦')
1234.5599999999999


The fact that this is (apparently) only being raised now means that it 
isn't actually a problem in real life. I'd even say that it's a feature, 
and that if Python didn't support non-Arabic numerals, it should.



-- 
Steven



More information about the Python-Dev mailing list