[Python-Dev] Python and the Unicode Character Database

Alexander Belopolsky alexander.belopolsky at gmail.com
Thu Dec 2 19:14:29 CET 2010


On Thu, Dec 2, 2010 at 11:56 AM, Antoine Pitrou <solipsis at pitrou.net> wrote:
> Le jeudi 02 décembre 2010 à 11:41 -0500, Alexander Belopolsky a écrit :
>>
>> Note that my point is not to find the correct answer here, but to
>> demonstrate that we as a group don't have the expertise to get parsing
>> of Arabic text right.
>
> I don't understand why you think Arabic or Hebrew text is any different
> from Western text. Surely right-to-left isn't more conceptually
> complicated than left-to-right, is it?
>

No, but a mix of LTR and RTL is certainly more difficult that either
of the two.  I invite you to digest Unicode Standard Annex #9 before
we continue this discussion.

See <http://unicode.org/reports/tr9/>.


> The fact that mixed rtl + ltr can render bizarrely or is awkward to cut
> and paste is quite off-topic for our discussion.
>

No, it is not.  One of the invented use cases in this thread was naive
users' desire to enter numbers using their preferred local decimals.
Same users may want to be able to cut and paste their decimals as
well.  More importantly, however, legacy formats may not have support
for mixed-direction text and may require that "John is 41" be stored
as "41 si nhoJ" and Unicode converter would turn it into "[RTL]John is
14"  that will still display as  "41 si nhoJ", but int(s[-2:]) will
return 14, not 41.

>> If we've got it right for Arabic, it is by
>> chance and not by design.  This still leaves us with 41 other types of
>> digits for at least 30 different languages.
>
> So why do you trust the Unicode standard on other things and not on this
> one?

What other things? As far as I understand the only str method that was
designed to comply with Unicode recomendations was str.isidentifier().
 And we have some really bizarre results:


>>> '\u2164'.isidentifier()
True
>>> '\u2164'.isalpha()
False

and can you describe the difference between str.isdigit() and
str.isdecimal()?  According to the reference manual,

"""
str.isdecimal()
Return true if all characters in the string are decimal characters and
there is at least one character, false otherwise. Decimal characters
include digit characters, and all characters that that can be used to
form decimal-radix numbers, e.g. U+0660, ARABIC-INDIC DIGIT ZERO.

str.isdigit()
Return true if all characters in the string are digits and there is at
least one character, false otherwise.
""" http://docs.python.org/dev/library/stdtypes.html#str.isdecimal

Since U+0660 is mentioned in the first definition and not in the
second, I may conclude that it is not a digit, but

>>> '\u0660'.isdigit()
True

If you know the correct answer, please contribute it here:
<http://bugs.python.org/issue10587>.


More information about the Python-Dev mailing list