[I18n-sig] How does Python Unicode treat surrogates?

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Sat, 23 Jun 2001 14:20:38 +0200


> About surrogate support in Python: the UTF-8 codec has full
> surrogate support for encodings and decoding

I think there are a number of bugs lying around here. For example,
shouldn't

>>> u" \ud800 ".encode("utf-8")
' \xa0\x80 '

give an error, since this is a lone low surrogate word?

Likewise, but somewhat more troubling, surrogates that straddle write
invocations are not processed properly.

>>> s=StringIO.StringIO()
>>> _,_,r,w=codecs.lookup("utf-8")
>>> f=w(s)
>>> f.write(u"\ud800")
>>> f.write(u"\udc00")
>>> f.flush()
>>> s.getvalue()
'\xa0\x80\xed\xb0\x80'

whereas the correct answer would have been

>>> u"\ud800\udc00".encode("utf-8")
'\xf0\x90\x80\x80'

Regards,
Martin