Re: unicode data - accessing codepoints > FFFF on narrow python builts

Wed Apr 18 15:56:40 EDT 2007

Hi, thanks for the answer,

> From: Gabriel Genellina <gagsl-py2 at yahoo.com.ar>
> Subj: Re: unicode data - accessing codepoints > FFFF on narrow python builts
> Datum: 18.4.2007 21:33:11
> ----------------------------------------
> 
> py> x=u"\N{GOTHIC LETTER AHSA}"
> py> ord(x)
> Traceback (most recent call last):
>    File "<stdin>", line 1, in <module>
> TypeError: ord() expected a character, but string of length 2 found
> py> unicodedata.name(x)
> Traceback (most recent call last):
>    File "<stdin>", line 1, in <module>
> TypeError: need a single Unicode character as parameter
> py> len(x)
> 2
> py> list(x)
> [u'\ud800', u'\udf30']

> 
> That looks like UTF-16 (?) but seen as two characters instead of one.
> Probably in a 32bits build Python should refuse to use such character (and  
> limit Unicode support to the basic plane?) (or not?) (if not, what's the  
> point of sys.maxunicode?) (enough parenthesis for now).
> 

> -- 
> Gabriel Genellina
> 

Yes, this is a UTF-16 surrogate pair, which is, as far as I know the usual way the characters outside the basic plane are handled on narrow python builds. There are some problems with it, but most things (I need) with non-basic plane characters can be done this way (GUI display, utf-8 text saving) - thus I wouldn't be happy, if this support were removed.
The problem is the access to unicodedata, which requires "a string of length 1"; I thought, it could also accept the codepoint number, but it doesn't seem to be possible.
Thanks again.

vbr - Vlastimil Brom