[I18n-sig] How does Python Unicode treat surrogates?

M.-A. Lemburg mal@lemburg.com
Sat, 23 Jun 2001 22:19:09 +0200


"Martin v. Loewis" wrote:
> 
> > About surrogate support in Python: the UTF-8 codec has full
> > surrogate support for encodings and decoding
> 
> I think there are a number of bugs lying around here. For example,
> shouldn't
> 
> >>> u" \ud800 ".encode("utf-8")
> ' \xa0\x80 '
> 
> give an error, since this is a lone low surrogate word?

Yes.
 
> Likewise, but somewhat more troubling, surrogates that straddle write
> invocations are not processed properly.
> 
> >>> s=StringIO.StringIO()
> >>> _,_,r,w=codecs.lookup("utf-8")
> >>> f=w(s)
> >>> f.write(u"\ud800")
> >>> f.write(u"\udc00")
> >>> f.flush()
> >>> s.getvalue()
> '\xa0\x80\xed\xb0\x80'
> 
> whereas the correct answer would have been
> 
> >>> u"\ud800\udc00".encode("utf-8")
> '\xf0\x90\x80\x80'

This is a special case of the above (since the encoder will
see truncated surrogates and should raise raise an exception 
for these).

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/