[I18n-sig] Unicode surrogates: just say no!
Fredrik Lundh
fredrik@pythonware.com
Tue, 26 Jun 2001 20:27:50 +0200
guido wrote:
> - with 16-bit (narrow) Py_UNICODE:
>
> - unichr(i) for 0 <= i <= 0xffff always returns a size-one string
> where ord(u[0]) == i
>
> - unichr(i) for 0x10000 <= i <= 0x10ffff (and hence corresponding \u
> and \U) generates a surrogate pair, where u[0] is the high
> surrogate value and u[1] the low surrogate value
>
> - unichr(i) for i >= 0x110000 (and hence corresponding \u and \U)
> raises an exception at Python-to-bytecode compile-time
or in other words:
>>> unichr.__doc__
'unichr(i) -> Unicode character\n\nReturn a Unicode string of one character with
ordinal i; 0 <= i < 1114112.'
>>> unichr(-1)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
ValueError: unichr() arg not in range(1114111)
>>> unichr(0)
u'\x00'
>>> unichr(1)
u'\x01'
>>> unichr(256)
u'\u0100'
>>> unichr(55296)
u'\ud800'
>>> unichr(65535)
u'\uffff'
>>> unichr(65536)
u'\ud800\udc00'
>>> unichr(1114111)
u'\udbff\udfff'
>>> unichr(1114112)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
ValueError: unichr() arg not in range(1114111)
>>> "\U00000000"
'\\U00000000'
>>> "\U00000100"
'\\U00000100'
>>> u"\U00000100"
u'\u0100'
>>> u"\U00000000"
u'\x00'
>>> u"\U00000000"
u'\x00'
>>> u"\U00000100"
u'\u0100'
>>> u"\U0000d800"
u'\ud800'
>>> u"\U0000ffff"
u'\uffff'
>>> u"\U00010000"
u'\ud800\udc00'
>>> u"\U0010ffff"
u'\udbff\udfff'
>>> u"\U00110000"
UnicodeError: Unicode-Escape decoding error: illegal Unicode character
(\U behaviour as in 2.1, unichr as in my development version of 2.2)
note that unichr raises a ValueError, not a UnicodeError. should this
be changed?
Cheers /F