python-unicode doesn't support >65535 symbols?

Thu Nov 27 06:25:07 EST 2003

hi,

today i made some tests...

i tested some unicode symbols, that are above the 16bit limit
(gothic:http://www.unicode.org/charts/PDF/U10330.pdf)
.

i played around with iconv and so on,
so at the end i created an utf8 encoded text file,
with the text "Marrakesh",
where the second 'a' wes replaced with
GOTHIC_LETTER_AHSA (unicode-value:0x10330).

(i simply wrote the text file "Marrakesh", used iconv to convert it to
utf32big-endian, and replaced the character in hexedit, then converted
with iconv back to utf8).

now i started python:

>>> data = open("utf8.txt").read()
>>> data
'Marr\xf0\x90\x8c\xb0kesh'
>>> text = data.decode("utf8")
>>> text
u'Marr\U00010330kesh'

so far it seemed ok.
then i did:

>>> len(text)
10

this is wrong. the length should be 9.
and why?

>>> text[0]
u'M'
>>> text[1]
u'a'
>>> text[2]
u'r'
>>> text[3]
u'r'
>>> text[4]
u'\ud800'
>>> text[5]
u'\udf30'
>>> text[6]
u'k'
>>>

so text[3] (which should be \U00010330),
was split to 2 16bit values (text[3] and text[4]).

i don't understand.
if tthe representation of 'text' is correct, why is the length wrong?

btw. i understand that it's a very exotic character, but i tried for
example kwrite and gedit, and none of the was able to display the
symbol, but both successfully identified it as ONE unknown symbol.

thanks,
gabor