[Python-Dev] Unicode character property methods

Mon, 06 Mar 2000 23:04:14 +0100

Guido van Rossum wrote:
> 
> > As you may have noticed, the Unicode objects provide
> > new methods .islower(), .isupper() and .istitle(). Finn Bock
> > mentioned that Java also provides .isdigit() and .isspace().
> >
> > Question: should Unicode also provide these character
> > property methods: .isdigit(), .isnumeric(), .isdecimal()
> > and .isspace() ? Plus maybe .digit(), .numeric() and
> > .decimal() for the corresponding decoding ?
> 
> What would be the difference between isdigit, isnumeric, isdecimal?
> I'd say don't do more than Java.  I don't understand what the
> "corresponding decoding" refers to.  What would "3".decimal() return?

These originate in the Unicode database; see

ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html

Here are the descriptions:

"""
6
      Decimal digit value
                        normative
                                     This is a numeric field. If the
                                     character has the decimal digit
                                     property, as specified in Chapter
                                     4 of the Unicode Standard, the
                                     value of that digit is represented
                                     with an integer value in this field
   7
      Digit value
                        normative
                                     This is a numeric field. If the
                                     character represents a digit, not
                                     necessarily a decimal digit, the
                                     value is here. This covers digits
                                     which do not form decimal radix
                                     forms, such as the compatibility
                                     superscript digits
   8
      Numeric value
                        normative
                                     This is a numeric field. If the
                                     character has the numeric
                                     property, as specified in Chapter
                                     4 of the Unicode Standard, the
                                     value of that character is
                                     represented with an integer or
                                     rational number in this field. This
                                     includes fractions as, e.g., "1/5" for
                                     U+2155 VULGAR FRACTION
                                     ONE FIFTH Also included are
                                     numerical values for compatibility
                                     characters such as circled
                                     numbers.

u"3".decimal() would return 3. u"\u2155".

Some more examples from the unicodedata module (which makes
all fields of the database available in Python):

>>> unicodedata.decimal(u"3")
3
>>> unicodedata.decimal(u"²")
2
>>> unicodedata.digit(u"²")
2
>>> unicodedata.numeric(u"²")
2.0
>>> unicodedata.numeric(u"\u2155")
0.2
>>> unicodedata.numeric(u'\u215b')
0.125

> > Similar APIs are already available through the unicodedata
> > module, but could easily be moved to the Unicode object
> > (they cause the builtin interpreter to grow a bit in size
> > due to the new mapping tables).
> >
> > BTW, string.atoi et al. are currently not mapped to
> > string methods... should they be ?
> 
> They are mapped to int() c.s.

Hmm, I just noticed that int() et friends don't like
Unicode... shouldn't they use the "t" parser marker 
instead of requiring a string or tp_int compatible
type ?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/