[Python-ideas] Add "has_surrogates" flags to string object

Tue Oct 8 22:37:54 CEST 2013

On Tue, Oct 8, 2013 at 7:20 AM, Steven D'Aprano <steve at pearwood.info> wrote:

> Given:
>
> c = '\N{LINEAR B SYLLABLE B038 E}'  # \U00010001
> c.encode('utf-8')
> => b'\xf0\x90\x80\x81'
>
> and:
>
> c.encode('utf-16BE')  # encodes as a surrogate pair
> => b'\xd8\x00\xdc\x01'
>
> then those same surrogates, taken as codepoints, should be encodable as
> UTF-8:
>
> '\ud800\udc01'.encode('utf-8')
> => b'\xf0\x90\x80\x81'
>
>
> I'd actually be disappointed if that were the case; I think that would
> be a poor design. But if that's what the Unicode standard demands,
> Python ought to support it.
>

The FAQ is explicit that this is wrong: "The definition of UTF-8 requires
that supplementary characters (those using surrogate pairs in UTF-16) be
encoded with a single four byte sequence."
http://www.unicode.org/faq/utf_bom.html#utf8-4

It goes on to say that there is a widespread practice of doing it anyway in
older software. Therefore, it might be acceptable to accept these
mis-encoded characters when *decoding* but they should never be generated
when *encoding*. I'd prefer not to have that on by default given the
history of overlong UTF-8 bugs (e.g., see
http://blogs.msdn.com/b/michael_howard/archive/2008/08/22/overlong-utf-8-escapes-bite.aspx).
Essentially if different decoders follow different rules, then you can
sometimes sneak stuff through the permissive decoders.

Notwithstanding that, there is a different unicode encoding CESU-8 which
does the opposite: it always encodes those characters requiring surrogate
pairs as 6 bytes consisting of two UTF-8-style encodings of the individual
surrogate codepoints. Python doesn't support this and the request to
support it was rejected: http://bugs.python.org/issue12742

--- Bruce
I'm hiring: http://www.cadencemd.com/info/jobs
Latest blog post: Alice's Puzzle Page http://www.vroospeak.com
Learn how hackers think: http://j.mp/gruyere-security
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20131008/dfe44a37/attachment-0001.html>