[Python-Dev] Python and the Unicode Character Database

Alexander Belopolsky alexander.belopolsky at gmail.com
Thu Dec 2 01:44:28 CET 2010


On Wed, Dec 1, 2010 at 7:17 PM, Steven D'Aprano <steve at pearwood.info> wrote:
..
> we should continue to support the existing behaviour. None of the arguments
> against it seem convincing to me, particularly since the opponents of the
> current behaviour admit that there is a use-case for it, but they just want
> it to move elsewhere, such as the locale module.
>

I don't remember who made this argument, but I think you misunderstood
it.  The argument was that if there was a use case for parsing Eastern
Arabic numerals, it would be better served by a module written by
someone who speaks one of the Arabic languages and knows the details
of how  Eastern Arabic numerals are written.  So far nobody has even
claimed to know conclusively that Arabic-Indic digits are always
written left-to-right.

>>> unicodedata.bidirectional('٤')
'AN'

is not very helpful because it means "any Arabic-Indic digit"
according to unicode.org.  (To me, a special category hints that it
may be written in either direction and the proper interpretation may
depend on context.)   I have not seen a real use case reported in this
thread and for theoretical use cases, the current implementation is
either outright wrong or does not solve the problem completely. Given
that a function that replaces all Unicode digits in a string with 0-9
can be written in 3 lines of Python code, it is very unlikely that
anyone would prefer to rely on undocumented behavior of Python
builtins instead of having explicit control over parsing of their
data.


More information about the Python-Dev mailing list