[Python-ideas] [Python-Dev] Unicode minus sign in numeric conversions
Stephen J. Turnbull
stephen at xemacs.org
Sun Jun 9 03:04:54 CEST 2013
Guido van Rossum writes:
> Is there a Unicode standard for parsing numbers?
Probably UTR #35 is most relevant.
Unicode Technical Standard #35
Unicode Locale Data Markup Language (LDML)
Part 3: Numbers
http://www.unicode.org/reports/tr35/tr35-numbers.html
(Maintained separately from "the" Unicode standard.)
Cf. http://www.unicode.org/versions/Unicode6.2.0/ch05.pdf, section
5.5, and http://www.unicode.org/versions/Unicode6.2.0/ch04.pdf,
section 4.4. tl;dr version: *If* you recognize a character as a
digit, it *must* be given the correct value. Parsing numerical
components from text is inherently hairy (eg, Roman numerals, which
exist in Unicode only as compatibility characters and therefore are
not normatively digits, and variable names such as "x2"), and so
considered application-specific.
UTR #35 recommends the "lenient" rules appended below for parsing
numerical data. N.B. Some terms such as "'ignore' set" are defined
elsewhere in the TR. Apparently lenient parsing is expected to cause
no problems in "honest" environments (the only exception mentioned is
security, eg, confusables).
The parse is explicitly locale-dependent in UTR #35, so could depend
on a subset of all possible numeric expression characters. Python
could take the position that the only numeric formats known to Python
are those based on ASCII (but even then there are ambiguities by
locale -- cf the 4th rule below). I think this is probably best by
default. Parsing numbers expressed in mixed scripts is clearly a
security risk, and doesn't seem to serve a useful purpose in numerical
data. To do a good job of parsing numerical information from general
text, a separate library which is sensitive to the context in which
the numbers are embedded is required.
------------------------------------------------------------------------
Here is a set of heuristic rules that may be helpful:
- Any character with the decimal digit property is unambiguous and
should be accepted.
Note: In some environments, applications may independently wish to
restrict the decimal digit set to prevent security problems. See
[UTR36].
- The exponent character can only be interpreted as such if it
occurs after at least one digit, and if it is followed by at least
one digit, with only an optional sign in between. A regular
expression may be helpful here.
- For the sign, decimal separator, percent, and per mille, use a set
of all possible characters that can serve those functions. For
example, the decimal separator set could include all of
[.,']. (The actual set of characters can be derived from the
number symbols in the By-Type charts [ByType], which list all of
the values in CLDR.) To disambiguate, the decimal separator for
the locale must be removed from the "ignore" set, and the grouping
separator for the locale must be removed from the decimal
separator set. The same principle applies to all sets and symbols:
any symbol must appear in at most one set.
- Since there are a wide variety of currency symbols and codes, this
should be tried before the less ambiguous elements. It may be
helpful to develop a set of characters that can appear in a symbol
or code, based on the currency symbols in the locale.
- Otherwise, a character should be ignored unless it is in the
"stop" set. This includes even characters that are meaningful for
formatting, for example, the grouping separator.
- If more than one sign, currency symbol, exponent, or percent/per
mille occurs in the input, the first found should be used.
- A currency symbol in the input should be interpreted as the
longest match found in the set of possible currency symbols.
- Especially in cases of ambiguity, the user's input should be
echoed back, properly formatted according to the locale, before it
is actually used for anything.
------------------------------------------------------------------------
More information about the Python-ideas
mailing list