Newbie question about text encoding

Sat Mar 7 10:40:35 EST 2015

Marko Rauhamaa wrote:

> That said, UTF-8 does suffer badly from its not being
> a bijective mapping.

Can you explain?

As far as I am aware, every code point has one and only one valid UTF-8
encoding, and every UTF-8 encoding has one and only one valid code point.

There are *invalid* UTF-8 encodings, such as CESU-8, which is sometimes
mislabelled as UTF-8 (Oracle, I'm looking at you.) It violates the rule
that valid UTF-8 encodings are the shortest possible.

E.g. SMP code points should be encoded to four bytes using UTF-8:

py> u'\U0010FF01'.encode('utf-8')  # U+10FF01
'\xf4\x8f\xbc\x81'

But in CESU-8, the code point is first interpreted as a UTF-16 surrogate
pair:

py> u'\U0010FF01'.encode('utf-16be')
'\xdb\xff\xdf\x01'

then each surrogate pair is treated as a 16-bit code unit and individually
encoded to three bytes using UTF-8:

py> u'\udbff'.encode('utf-8')
'\xed\xaf\xbf'
py> u'\udf01'.encode('utf-8')
'\xed\xbc\x81'

giving six bytes in total:

'\xed\xaf\xbf\xed\xbc\x81'

This is not UTF-8! But some software mislabels it as UTF-8.

-- 
Steven