trying to understand unicode

F. Petitjean littlejohn.75 at news.free.fr
Wed Apr 20 06:58:35 EDT 2005


Python has a very good support of unicode, utf8, encodings ... But I
have some difficulties with the concepts and the vocabulary. The
documentation is not bad, but for example in reading
http://docs.python.org/lib/module-unicodedata.html
I had a long time to figure out what unicodedata.digit(unichr) would
mean, a simple example is badly lacking.

So I wrote the following script :

#!/usr/bin/env python

"""Example of use of the unicodedata module
http://docs.python.org/lib/module-unicodedata.html
"""

import unicodedata
import sys

# outcodec = 'latin_1'
outcodec = 'iso8859_15'
if len(sys.argv) > 1:
    outcodec = sys.argv[1]

for c in range(256):
    uc = unichr(c)
    uname = unicodedata.name(uc, None)
    if uname:
        unfd = unicodedata.normalize('NFD', uc).encode(outcodec,
'replace')
        unfc = unicodedata.normalize('NFC', uc).encode(outcodec,
'replace')
        print str(c).ljust(3), uname.ljust(42), unfd.ljust(2),
unfc.ljust(2), \
                unicodedata.category(uc), unicodedata.numeric(uc, None)


and here are some samples of output
44  COMMA                                      ,  ,  Po None
45  HYPHEN-MINUS                               -  -  Pd None
46  FULL STOP                                  .  .  Po None
47  SOLIDUS                                    /  /  Po None
48  DIGIT ZERO                                 0  0  Nd 0.0
49  DIGIT ONE                                  1  1  Nd 1.0
50  DIGIT TWO                                  2  2  Nd 2.0

It seems that 'Nd' category means Numerical digit  doh!

64  COMMERCIAL AT                              @  @  Po None
65  LATIN CAPITAL LETTER A                     A  A  Lu None
66  LATIN CAPITAL LETTER B                     B  B  Lu None

'Lu' should read 'Letter upper' ?

94  CIRCUMFLEX ACCENT                          ^  ^  Sk None
95  LOW LINE                                   _  _  Pc None
96  GRAVE ACCENT                               `  `  Sk None
97  LATIN SMALL LETTER A                       a  a  Ll None
98  LATIN SMALL LETTER B                       b  b  Ll None
'Ll' == Letter lower

124 VERTICAL LINE                              |  |  Sm None
125 RIGHT CURLY BRACKET                        }  }  Pe None
126 TILDE                                      ~  ~  Sm None
160 NO-BREAK SPACE                                   Zs None
161 INVERTED EXCLAMATION MARK                  ¡  ¡  Po None

What a gap !

245 LATIN SMALL LETTER O WITH TILDE            o? õ  Ll None
246 LATIN SMALL LETTER O WITH DIAERESIS        o? ö  Ll None
247 DIVISION SIGN                              ÷  ÷  Sm None
248 LATIN SMALL LETTER O WITH STROKE           ø  ø  Ll None

'Sm' should read 'sign mathematics' ?

I think that such code snippets should be included in the documentation
or in a Wiki.

Regards

Sorry for bad english, I'm not a native speaker.



More information about the Python-list mailing list