unicode data - accessing codepoints > FFFF on narrow python builts
Gabriel Genellina
gagsl-py2 at yahoo.com.ar
Wed Apr 18 15:32:51 EDT 2007
En Wed, 18 Apr 2007 06:37:56 -0300, <vbr at email.cz> escribió:
> Hi all,
> I'd like to ask about the usage of unicode data on a narrow python build.
> Unicode string literals \N{name} work even without the (explicit) import
> of unicodedata and it correctly handles also the "wider" unicodes
> planes - over FFFF
>
>>>> u"\N{LATIN SMALL LETTER E}"
> u'e'
>>>> u"\N{GOTHIC LETTER AHSA}"
> u'\U00010330'
>
> The unicode data functions works analogous in the basic plane, but
> behave differently otherwise:
>
>>>> unicodedata.lookup("LATIN SMALL LETTER E")
> u'e'
>>>> unicodedata.lookup("GOTHIC LETTER AHSA")
> u'\u0330'
>
> (0001 gets trimmed)
>
> Is it a bug in unicodedata, or is this the expected behaviour on a
> narrow build?
Looks like a bug, but I'm not sure whether in unicodedata or in general
Unicode support:
py> x=u"\N{GOTHIC LETTER AHSA}"
py> ord(x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found
py> unicodedata.name(x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: need a single Unicode character as parameter
py> len(x)
2
py> list(x)
[u'\ud800', u'\udf30']
That looks like UTF-16 (?) but seen as two characters instead of one.
Probably in a 32bits build Python should refuse to use such character (and
limit Unicode support to the basic plane?) (or not?) (if not, what's the
point of sys.maxunicode?) (enough parenthesis for now).
Anyway a better place for bug reports is
http://sourceforge.net/tracker/?group_id=5470
--
Gabriel Genellina
More information about the Python-list
mailing list