Encoding of surrogate code points to UTF-8

Tue Oct 8 18:20:58 EDT 2013

On Tue, 08 Oct 2013 18:00:58 +0100, MRAB wrote:

> The only time you should get a surrogate pair in a Unicode string is in
> a narrow build, which doesn't exist in Python 3.3 and later.

Incorrect.

py> sys.version
'3.3.0rc3 (default, Sep 27 2012, 18:44:58) \n[GCC 4.1.2 20080704 (Red Hat 
4.1.2-52)]'
py> s = '\ud800\udc01'
py> print(len(s))
2
py> import unicodedata as ud
py> for c in s:
...     print(ud.category(c))
...
Cs
Cs

s is a string containing two code points making up a surrogate pair.

It is very frustrating that the Unicode FAQs don't always clearly 
distinguish between when they are talking about bytes and when they are 
talking about code points. This area about surrogates is one of places 
where they conflate the two.

-- 
Steven