[Python-ideas] [Python-Dev] Unicode minus sign in numeric conversions

Sun Jun 9 07:30:52 CEST 2013

I'm beginning to feel that it was even a mistake to accept all those
other Unicode decimal digits, because it leads to the mistaken belief
that one can parse a number without knowing the locale. Python's
position about this is very different from what the heuristics you
quote seem to suggest: "refuse the temptation to guess" leads to our
current, simple rule which only accepts '.' as the decimal indicator,
and leaves localized parsing strictly to the application (or to some
other library module).

Still, I suppose I could defend the current behavior from the
perspective of writing such a localized parser -- at some point you've
got a digit and you need to know its numeric value, and it is
convenient that int(c) does so. But interpreting the sign is different
-- once you know that a minus sign is present, there's already a
built-in operation to apply it (-x). So I'm not at all sure that the
behavior Łukasz observed should be considered a bug in the language.

--Guido

On Sat, Jun 8, 2013 at 6:04 PM, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> Guido van Rossum writes:
>
>  > Is there a Unicode standard for parsing numbers?
>
> Probably UTR #35 is most relevant.
>
> Unicode Technical Standard #35
> Unicode Locale Data Markup Language (LDML)
> Part 3: Numbers
> http://www.unicode.org/reports/tr35/tr35-numbers.html
> (Maintained separately from "the" Unicode standard.)
>
> Cf. http://www.unicode.org/versions/Unicode6.2.0/ch05.pdf, section
> 5.5, and http://www.unicode.org/versions/Unicode6.2.0/ch04.pdf,
> section 4.4.  tl;dr version: *If* you recognize a character as a
> digit, it *must* be given the correct value.  Parsing numerical
> components from text is inherently hairy (eg, Roman numerals, which
> exist in Unicode only as compatibility characters and therefore are
> not normatively digits, and variable names such as "x2"), and so
> considered application-specific.
>
> UTR #35 recommends the "lenient" rules appended below for parsing
> numerical data.  N.B. Some terms such as "'ignore' set" are defined
> elsewhere in the TR.  Apparently lenient parsing is expected to cause
> no problems in "honest" environments (the only exception mentioned is
> security, eg, confusables).
>
> The parse is explicitly locale-dependent in UTR #35, so could depend
> on a subset of all possible numeric expression characters.  Python
> could take the position that the only numeric formats known to Python
> are those based on ASCII (but even then there are ambiguities by
> locale -- cf the 4th rule below).  I think this is probably best by
> default.  Parsing numbers expressed in mixed scripts is clearly a
> security risk, and doesn't seem to serve a useful purpose in numerical
> data.  To do a good job of parsing numerical information from general
> text, a separate library which is sensitive to the context in which
> the numbers are embedded is required.
>
>
> ------------------------------------------------------------------------
> Here is a set of heuristic rules that may be helpful:
>
>  -  Any character with the decimal digit property is unambiguous and
>     should be accepted.
>     Note: In some environments, applications may independently wish to
>     restrict the decimal digit set to prevent security problems. See
>     [UTR36].
>
>  -  The exponent character can only be interpreted as such if it
>     occurs after at least one digit, and if it is followed by at least
>     one digit, with only an optional sign in between. A regular
>     expression may be helpful here.
>
>  -  For the sign, decimal separator, percent, and per mille, use a set
>     of all possible characters that can serve those functions. For
>     example, the decimal separator set could include all of
>     [.,']. (The actual set of characters can be derived from the
>     number symbols in the By-Type charts [ByType], which list all of
>     the values in CLDR.) To disambiguate, the decimal separator for
>     the locale must be removed from the "ignore" set, and the grouping
>     separator for the locale must be removed from the decimal
>     separator set. The same principle applies to all sets and symbols:
>     any symbol must appear in at most one set.
>
>  -  Since there are a wide variety of currency symbols and codes, this
>     should be tried before the less ambiguous elements. It may be
>     helpful to develop a set of characters that can appear in a symbol
>     or code, based on the currency symbols in the locale.
>
>  -  Otherwise, a character should be ignored unless it is in the
>     "stop" set. This includes even characters that are meaningful for
>     formatting, for example, the grouping separator.
>
>  -  If more than one sign, currency symbol, exponent, or percent/per
>     mille occurs in the input, the first found should be used.
>
>  -  A currency symbol in the input should be interpreted as the
>     longest match found in the set of possible currency symbols.
>
>  -  Especially in cases of ambiguity, the user's input should be
>     echoed back, properly formatted according to the locale, before it
>     is actually used for anything.
> ------------------------------------------------------------------------
>
>

-- 
--Guido van Rossum (python.org/~guido)