unicode data - accessing codepoints > FFFF on narrow python builts

Gabriel Genellina gagsl-py2 at yahoo.com.ar
Wed Apr 18 15:32:51 EDT 2007


En Wed, 18 Apr 2007 06:37:56 -0300, <vbr at email.cz> escribió:

> Hi all,
> I'd like to ask about the usage of unicode data on a narrow python build.
> Unicode string literals \N{name} work even without the (explicit) import  
> of unicodedata and it correctly handles also the  "wider" unicodes  
> planes - over FFFF
>
>>>>  u"\N{LATIN SMALL LETTER E}"
> u'e'
>>>>  u"\N{GOTHIC LETTER AHSA}"
> u'\U00010330'
>
> The unicode data functions works analogous in the basic plane, but  
> behave differently otherwise:
>
>>>>  unicodedata.lookup("LATIN SMALL LETTER E")
> u'e'
>>>> unicodedata.lookup("GOTHIC LETTER AHSA")
> u'\u0330'
>
> (0001 gets trimmed)
>
> Is it a bug in unicodedata, or is this the expected behaviour on a  
> narrow build?

Looks like a bug, but I'm not sure whether in unicodedata or in general  
Unicode support:

py> x=u"\N{GOTHIC LETTER AHSA}"
py> ord(x)
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found
py> unicodedata.name(x)
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
TypeError: need a single Unicode character as parameter
py> len(x)
2
py> list(x)
[u'\ud800', u'\udf30']

That looks like UTF-16 (?) but seen as two characters instead of one.
Probably in a 32bits build Python should refuse to use such character (and  
limit Unicode support to the basic plane?) (or not?) (if not, what's the  
point of sys.maxunicode?) (enough parenthesis for now).

Anyway a better place for bug reports is  
http://sourceforge.net/tracker/?group_id=5470

-- 
Gabriel Genellina




More information about the Python-list mailing list