[Python-ideas] [Python-Dev] Unicode minus sign in numeric conversions

Sun Jun 9 03:04:54 CEST 2013

Guido van Rossum writes:

 > Is there a Unicode standard for parsing numbers?

Probably UTR #35 is most relevant.

Unicode Technical Standard #35
Unicode Locale Data Markup Language (LDML)
Part 3: Numbers
http://www.unicode.org/reports/tr35/tr35-numbers.html
(Maintained separately from "the" Unicode standard.)

Cf. http://www.unicode.org/versions/Unicode6.2.0/ch05.pdf, section
5.5, and http://www.unicode.org/versions/Unicode6.2.0/ch04.pdf,
section 4.4.  tl;dr version: *If* you recognize a character as a
digit, it *must* be given the correct value.  Parsing numerical
components from text is inherently hairy (eg, Roman numerals, which
exist in Unicode only as compatibility characters and therefore are
not normatively digits, and variable names such as "x2"), and so
considered application-specific.

UTR #35 recommends the "lenient" rules appended below for parsing
numerical data.  N.B. Some terms such as "'ignore' set" are defined
elsewhere in the TR.  Apparently lenient parsing is expected to cause
no problems in "honest" environments (the only exception mentioned is
security, eg, confusables).

The parse is explicitly locale-dependent in UTR #35, so could depend
on a subset of all possible numeric expression characters.  Python
could take the position that the only numeric formats known to Python
are those based on ASCII (but even then there are ambiguities by
locale -- cf the 4th rule below).  I think this is probably best by
default.  Parsing numbers expressed in mixed scripts is clearly a
security risk, and doesn't seem to serve a useful purpose in numerical
data.  To do a good job of parsing numerical information from general
text, a separate library which is sensitive to the context in which
the numbers are embedded is required.

------------------------------------------------------------------------
Here is a set of heuristic rules that may be helpful:

 -  Any character with the decimal digit property is unambiguous and
    should be accepted.
    Note: In some environments, applications may independently wish to
    restrict the decimal digit set to prevent security problems. See
    [UTR36].

 -  The exponent character can only be interpreted as such if it
    occurs after at least one digit, and if it is followed by at least
    one digit, with only an optional sign in between. A regular
    expression may be helpful here.

 -  For the sign, decimal separator, percent, and per mille, use a set
    of all possible characters that can serve those functions. For
    example, the decimal separator set could include all of
    [.,']. (The actual set of characters can be derived from the
    number symbols in the By-Type charts [ByType], which list all of
    the values in CLDR.) To disambiguate, the decimal separator for
    the locale must be removed from the "ignore" set, and the grouping
    separator for the locale must be removed from the decimal
    separator set. The same principle applies to all sets and symbols:
    any symbol must appear in at most one set.

 -  Since there are a wide variety of currency symbols and codes, this
    should be tried before the less ambiguous elements. It may be
    helpful to develop a set of characters that can appear in a symbol
    or code, based on the currency symbols in the locale.

 -  Otherwise, a character should be ignored unless it is in the
    "stop" set. This includes even characters that are meaningful for
    formatting, for example, the grouping separator.

 -  If more than one sign, currency symbol, exponent, or percent/per
    mille occurs in the input, the first found should be used.

 -  A currency symbol in the input should be interpreted as the
    longest match found in the set of possible currency symbols.

 -  Especially in cases of ambiguity, the user's input should be
    echoed back, properly formatted according to the locale, before it
    is actually used for anything.
------------------------------------------------------------------------