[I18n-sig] Unicode surrogates: just say no!

Fredrik Lundh fredrik@pythonware.com
Tue, 26 Jun 2001 20:27:50 +0200


guido wrote:

> - with 16-bit (narrow) Py_UNICODE:
> 
>   - unichr(i) for 0 <= i <= 0xffff always returns a size-one string
>     where ord(u[0]) == i
> 
>   - unichr(i) for 0x10000 <= i <= 0x10ffff (and hence corresponding \u
>     and \U) generates a surrogate pair, where u[0] is the high
>     surrogate value and u[1] the low surrogate value
> 
>   - unichr(i) for i >= 0x110000 (and hence corresponding \u and \U)
>     raises an exception at Python-to-bytecode compile-time

or in other words:

>>> unichr.__doc__
'unichr(i) -> Unicode character\n\nReturn a Unicode string of one character with
ordinal i; 0 <= i < 1114112.'
>>> unichr(-1)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
ValueError: unichr() arg not in range(1114111)
>>> unichr(0)
u'\x00'
>>> unichr(1)
u'\x01'
>>> unichr(256)
u'\u0100'
>>> unichr(55296)
u'\ud800'
>>> unichr(65535)
u'\uffff'
>>> unichr(65536)
u'\ud800\udc00'
>>> unichr(1114111)
u'\udbff\udfff'
>>> unichr(1114112)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
ValueError: unichr() arg not in range(1114111)

>>> "\U00000000"
'\\U00000000'
>>> "\U00000100"
'\\U00000100'
>>> u"\U00000100"
u'\u0100'
>>> u"\U00000000"
u'\x00'
>>> u"\U00000000"
u'\x00'
>>> u"\U00000100"
u'\u0100'
>>> u"\U0000d800"
u'\ud800'
>>> u"\U0000ffff"
u'\uffff'
>>> u"\U00010000"
u'\ud800\udc00'
>>> u"\U0010ffff"
u'\udbff\udfff'
>>> u"\U00110000"
UnicodeError: Unicode-Escape decoding error: illegal Unicode character

(\U behaviour as in 2.1, unichr as in my development version of 2.2)

note that unichr raises a ValueError, not a UnicodeError.  should this
be changed?

Cheers /F