Unicode utf-8 doesn't do back-and-forth?

Piet van Oostrum piet at cs.uu.nl
Wed Jul 10 06:07:27 EDT 2002


>>>>> loewis at informatik.hu-berlin.de (Martin v. Löwis) (MvL) writes:

MvL> Piet van Oostrum <piet at cs.uu.nl> writes:
>> I looked it up in The Online Edition of The Unicode Standard, Version 3.0
>> In chapter 3, section 3.8 it said:
>> 
>> Because every Unicode coded character sequence maps to a unique
>> sequence of code values in a given UTF, a reverse mapping can be
>> derived. Thus every UTF supports lossless roundtrip transcoding:
>> mapping from any Unicode coded character sequence S to a sequence
>> of code values and back will produce S again. To ensure that
>> round-trip transcoding is possible, a UTF mapping _must also_ map
>> invalid Unicode scalar values to unique code value sequences. These
>> invalid scalar values include FFFE, FFFF and unpaired surrogates.
MvL> [...]
>> Technical Reports tr27 and tr28 do not withdraw this.

MvL> Notice that this is a requirement onto UTFs, not onto implementations
MvL> of Unicode. So the UTF must define a mapping. UTF-8 happens to map
MvL> invalid code sequences to illegal byte sequences. Unicode 3.1 now
MvL> mandates that such sequences are flagged as an error.

So should the UTF-8 encoding raise an exception then on encountering an
unpaired surrogate rather than generating an illegal UTF-8 sequence?
Tr28 says:
    It is illegal to emit or interpret any ill-formed code unit sequence.
But this is exactly what the Python UTF-8 encoding does. It gives
'\xa0\x80' as the encoding of u'\ud800', which is an il-formed UTF-8
sequence in any sense of the term (i.e. even in the old specs).
-- 
Piet van Oostrum <piet at cs.uu.nl>
URL: http://www.cs.uu.nl/~piet [PGP]
Private email: P.van.Oostrum at hccnet.nl



More information about the Python-list mailing list