Python's handling of unicode surrogates

Tue Apr 24 07:43:16 EDT 2007

Ross Ridge <rridge at caffeine.csclub.uwaterloo.ca> writes:

 > The Unicode standard doesn't require that you support surrogates,
 > or any other kind of character, so no you wouldn't be lying.

+1 on Ross Ridge's contributions to this thread.

If Unicode is processed using UTF-8 or UTF-32 encoding forms then
there are no surrogates.  They would only be present in UTF-16.
CESU-8 is strongly discouraged.

A Unicode 16-bit string is allowed to be ill-formed as UTF-16.  The
example they give is one string that ends with a high surrogate code
point and another that starts with a low surrogate code point.  The
result of concatenation is a valid UTF-16 string.

The above refers to the Unicode standard.  In Python with narrow
Py_UNICODE a unicode string is a sequence of 16-bit Unicode code
points.  It is up to the programmer whether they want to specially
handle code points for surrogates.  Operations based on concatenation
will conform to Unicode, whether or not there are surrogates in the
strings.
-- 
Pete Forman                -./\.-  Disclaimer: This post is originated
WesternGeco                  -./\.-   by myself and does not represent
pete.forman at westerngeco.com    -./\.-   the opinion of Schlumberger or
http://petef.port5.com           -./\.-   WesternGeco.