Re: unicode data - accessing codepoints > FFFF on narrow python builts
vbr at email.cz
vbr at email.cz
Wed Apr 18 15:56:40 EDT 2007
Hi, thanks for the answer,
> From: Gabriel Genellina <gagsl-py2 at yahoo.com.ar>
> Subj: Re: unicode data - accessing codepoints > FFFF on narrow python builts
> Datum: 18.4.2007 21:33:11
> ----------------------------------------
>
> py> x=u"\N{GOTHIC LETTER AHSA}"
> py> ord(x)
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> TypeError: ord() expected a character, but string of length 2 found
> py> unicodedata.name(x)
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> TypeError: need a single Unicode character as parameter
> py> len(x)
> 2
> py> list(x)
> [u'\ud800', u'\udf30']
>
> That looks like UTF-16 (?) but seen as two characters instead of one.
> Probably in a 32bits build Python should refuse to use such character (and
> limit Unicode support to the basic plane?) (or not?) (if not, what's the
> point of sys.maxunicode?) (enough parenthesis for now).
>
> --
> Gabriel Genellina
>
Yes, this is a UTF-16 surrogate pair, which is, as far as I know the usual way the characters outside the basic plane are handled on narrow python builds. There are some problems with it, but most things (I need) with non-basic plane characters can be done this way (GUI display, utf-8 text saving) - thus I wouldn't be happy, if this support were removed.
The problem is the access to unicodedata, which requires "a string of length 1"; I thought, it could also accept the codepoint number, but it doesn't seem to be possible.
Thanks again.
vbr - Vlastimil Brom
More information about the Python-list
mailing list