[Python-Dev] Python and the Unicode Character Database

M.-A. Lemburg mal at egenix.com
Fri Dec 3 11:15:51 CET 2010


Alexander Belopolsky wrote:
> On Thu, Dec 2, 2010 at 5:58 PM, M.-A. Lemburg <mal at egenix.com> wrote:
> ..
>>> I will change my mind on this issue when you present a
>>> machine-readable file with Arabic-Indic numerals and a program capable
>>> of reading it and show that this program uses the same number parsing
>>> algorithm as Python's int() or float().
>>
>> Have you had a look at the examples I posted ? They include texts
>> and tables with numbers written using east asian arabic numerals.
> 
> Yes, but this was all about output.  I am pretty sure TeX was able to
> typeset Qur'an in all its glory long before Unicode was invented.
> Yet, in machine readable form it would be something like {\quran 1}
> (invented directive).   I have asked for a file that is intended for
> machine processing, not for human enjoyment in print or on a display.
>  I claim that if such file exists, the program that reads it does not
> use the same rules as Python and converting non-ascii digits would be
> a tiny portion of what that program does.

Well, programs that take input from the keyboards I posted in this
thread will have to deal with the digits. Since Python's input()
accepts keyboard input, you have your use case :-)

Seriously, I find the distinction between input and output forms
of numerals somewhat misguided. Any output can also serve as input.
For books and other printed material, images, etc. you have scanners
and OCR. For screen output you have screen readers. For spreadsheets
and data, you have CSV, TSV, XML, etc. etc. etc.

Just for the fun of it, I created a CSV file with Thai and Dzongkha
numerals (in addition to Arabic ones) using OpenOffice. Here's the
cut and paste version:

"""
Numbers in various scripts		
		
Arabic	Thai	Dzongkha
1	๑	༡
2	๒	༢
3	๓	༣
4	๔	༤
5	๕	༥
6	๖	༦
7	๗	༧
8	๘	༨
9	๙	༩
10	๑๐	༡༠
11	๑๑	༡༡
12	๑๒	༡༢
13	๑๓	༡༣
14	๑๔	༡༤
15	๑๕	༡༥
16	๑๖	༡༦
17	๑๗	༡༧
18	๑๘	༡༨
19	๑๙	༡༩
20	๒๐	༢༠
"""

And here's the script that goes with it:

import csv
c = csv.reader(open('Numbers-in-various-scripts.csv'))
headers = [c.next() for i in range(3)]
while c:
    print [int(unicode(x, 'utf-8')) for x in c.next()]

and the output using Python 2.7:

[1, 1, 1]
[2, 2, 2]
[3, 3, 3]
[4, 4, 4]
[5, 5, 5]
[6, 6, 6]
[7, 7, 7]
[8, 8, 8]
[9, 9, 9]
[10, 10, 10]
[11, 11, 11]
[12, 12, 12]
[13, 13, 13]
[14, 14, 14]
[15, 15, 15]
[16, 16, 16]
[17, 17, 17]
[18, 18, 18]
[19, 19, 19]
[20, 20, 20]

If you need more such files, I can generate as many as you like ;-)
I can send the OOo file as well, if you like to play around with it.

I'd say: case closed :-)

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Dec 03 2010)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Numbers-in-various-scripts.csv
URL: <http://mail.python.org/pipermail/python-dev/attachments/20101203/0f4a8bee/attachment.ksh>


More information about the Python-Dev mailing list