Newbie question about text encoding
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Sat Mar 7 10:40:35 EST 2015
Marko Rauhamaa wrote:
> That said, UTF-8 does suffer badly from its not being
> a bijective mapping.
Can you explain?
As far as I am aware, every code point has one and only one valid UTF-8
encoding, and every UTF-8 encoding has one and only one valid code point.
There are *invalid* UTF-8 encodings, such as CESU-8, which is sometimes
mislabelled as UTF-8 (Oracle, I'm looking at you.) It violates the rule
that valid UTF-8 encodings are the shortest possible.
E.g. SMP code points should be encoded to four bytes using UTF-8:
py> u'\U0010FF01'.encode('utf-8') # U+10FF01
'\xf4\x8f\xbc\x81'
But in CESU-8, the code point is first interpreted as a UTF-16 surrogate
pair:
py> u'\U0010FF01'.encode('utf-16be')
'\xdb\xff\xdf\x01'
then each surrogate pair is treated as a 16-bit code unit and individually
encoded to three bytes using UTF-8:
py> u'\udbff'.encode('utf-8')
'\xed\xaf\xbf'
py> u'\udf01'.encode('utf-8')
'\xed\xbc\x81'
giving six bytes in total:
'\xed\xaf\xbf\xed\xbc\x81'
This is not UTF-8! But some software mislabels it as UTF-8.
--
Steven
More information about the Python-list
mailing list