[I18n-sig] Unicode surrogates: just say no!
Guido van Rossum
guido@digicool.com
Tue, 26 Jun 2001 15:39:16 -0400
> guido wrote:
>
> > - with 16-bit (narrow) Py_UNICODE:
> >
> > - unichr(i) for 0 <= i <= 0xffff always returns a size-one string
> > where ord(u[0]) == i
> >
> > - unichr(i) for 0x10000 <= i <= 0x10ffff (and hence corresponding \u
> > and \U) generates a surrogate pair, where u[0] is the high
> > surrogate value and u[1] the low surrogate value
> >
> > - unichr(i) for i >= 0x110000 (and hence corresponding \u and \U)
> > raises an exception at Python-to-bytecode compile-time
>
> or in other words:
>
> >>> unichr.__doc__
> 'unichr(i) -> Unicode character\n\nReturn a Unicode string of one character with
> ordinal i; 0 <= i < 1114112.'
I would write 0 <= i <= 0x10ffff, but otherwise, yes. Check it in
already!
> note that unichr raises a ValueError, not a UnicodeError. should this
> be changed?
I think not. The input value is wrong, that's a ValueError. There
are lots of ValueErrors in the Unicode implementation. There are lots
of UnicodeErrors too; the distinction isn't always clear. MAL?
--Guido van Rossum (home page: http://www.python.org/~guido/)