python-unicode doesn't support >65535 symbols?

gabor gabor at z10n.net
Thu Nov 27 06:25:07 EST 2003


hi,

today i made some tests...

i tested some unicode symbols, that are above the 16bit limit
(gothic:http://www.unicode.org/charts/PDF/U10330.pdf)
.

i played around with iconv and so on,
so at the end i created an utf8 encoded text file,
with the text "Marrakesh",
where the second 'a' wes replaced with
GOTHIC_LETTER_AHSA (unicode-value:0x10330).

(i simply wrote the text file "Marrakesh", used iconv to convert it to
utf32big-endian, and replaced the character in hexedit, then converted
with iconv back to utf8).

now i started python:

>>> data = open("utf8.txt").read()
>>> data
'Marr\xf0\x90\x8c\xb0kesh'
>>> text = data.decode("utf8")
>>> text
u'Marr\U00010330kesh'

so far it seemed ok.
then i did:

>>> len(text)
10

this is wrong. the length should be 9.
and why?

>>> text[0]
u'M'
>>> text[1]
u'a'
>>> text[2]
u'r'
>>> text[3]
u'r'
>>> text[4]
u'\ud800'
>>> text[5]
u'\udf30'
>>> text[6]
u'k'
>>>

so text[3] (which should be \U00010330),
was split to 2 16bit values (text[3] and text[4]).

i don't understand.
if tthe representation of 'text' is correct, why is the length wrong?

btw. i understand that it's a very exotic character, but i tried for
example kwrite and gedit, and none of the was able to display the
symbol, but both successfully identified it as ONE unknown symbol.

thanks,
gabor








More information about the Python-list mailing list