[Python-Dev] Python and the Unicode Character Database

Alexander Belopolsky alexander.belopolsky at gmail.com
Fri Dec 3 06:10:29 CET 2010


On Thu, Dec 2, 2010 at 4:57 PM, Mark Dickinson <dickinsm at gmail.com> wrote:
..
> (the decimal spec requires that non-European digits be accepted).

Mark,

I think *requires* is too strong of a word to describe what the spec
says.   The decimal module documentation refers to two authorities:

1. IBM’s General Decimal Arithmetic Specification
2. IEEE standard 854-1987

The IEEE standards predates Unicode and unsurprisingly does not have
anything related to the issue.  the IBM's spec says the following in
the Conversions section:

"""
It is recommended that implementations also provide additional number
formatting routines (including some which are locale-dependent), and
if available should accept non-European decimal digits in strings.
""" http://speleotrove.com/decimal/daconvs.html

This cannot possibly be interpreted as normative text.  The emphasis
is clearly on "formatting routines" with "non-European decimal digits"
added as an afterthought.  This recommendation can reasonably be
interpreted as a requirement that conversion routines should accept
what formatting routines can produce.  In Python there are no
formatting routines to produce non-European numerals, so there is no
requirement to accept them in conversions.

I don't think decimal module should support non-European decimal
digits.  The only place where it can make some sense is in int()
because here we have a fighting chance of producing a reasonable
definition.   The motivating use case is conversion of numerical data
extracted from text using simple '\d+'  regex matches.

Here is how I would do it:

1.  String x of non-European decimal digits is only accepted in
int(x), but not by int(x, 0) or int(x, 10).
2.  If x contains one or more non-European digits, then

    (a)  all digits must be from the same block:

      def basepoint(c):
            return ord(c) - unicodedata.digit(c)
      all(basepoint(c) == basepoint(x[0]) for c in x) -> True

     (b) and '+' or '-' sign is not alowed.

3. A character c is a digit if it matches '\d' regex.  I think this
means unicodedata.category(c) -> 'Nd'.

Condition 2(b) is important because there is no clear way to define
what is acceptable as '+' or '-' using Unicode character properties
and not all number systems even support local form of negation.  (It
is also YAGNI.)


More information about the Python-Dev mailing list