break unichr instead of fix ord?
Vlastimil Brom
vlastimil.brom at gmail.com
Wed Aug 26 04:05:17 EDT 2009
2009/8/25 <rurpy at yahoo.com>:
> In Python 2.5 on Windows I could do [*1]:
>
> # Create a unicode character outside of the BMP.
> >>> a = u'\U00010040'
>
> # On Windows it is represented as a surogate pair.
> >>> len(a)
> 2
> >>> a[0],a[1]
> (u'\ud800', u'\udc40')
>
> # Create the same character with the unichr() function.
> >>> a = unichr (65600)
> >>> a[0],a[1]
> (u'\ud800', u'\udc40')
>
> # Although the unichr() function works fine, its
> # inverse, ord(), doesn't.
> >>> ord (a)
> TypeError: ord() expected a character, but string of length 2 found
>
> On Python 2.6, unichr() was "fixed" (using the word
> loosely) so that it too now fails with characters outside
> the BMP.
>
> >>> a = unichr (65600)
> ValueError: unichr() arg not in range(0x10000) (narrow Python build)
>
> Why was this done rather than changing ord() to accept a
> surrogate pair?
>
> Does not this effectively make unichr() and ord() useless
> on Windows for all but a subset of unicode characters?
> --
> http://mail.python.org/mailman/listinfo/python-list
>
Hi,
I'm not sure about the exact reasons for this behaviour on narrow
builds either (maybe the consistency of the input/ output data to
exactly one character?).
However, if I need these functions for higher unicode planes, the
following rather hackish replacements seem to work. I presume, there
might be smarter ways of dealing with this, but anyway...
hth,
vbr
#### not (systematically) tested #####################################
import sys
def wide_ord(char):
try:
return ord(char)
except TypeError:
if len(char) == 2 and 0xD800 <= ord(char[0]) <= 0xDBFF and
0xDC00 <= ord(char[1]) <= 0xDFFF:
return (ord(char[0]) - 0xD800) * 0x400 + (ord(char[1]) -
0xDC00) + 0x10000
else:
raise TypeError("invalid character input")
def wide_unichr(i):
if i <= sys.maxunicode:
return unichr(i)
else:
return ("\U"+str(hex(i))[2:].zfill(8)).decode("unicode-escape")
More information about the Python-list
mailing list