Unicode utf-8 doesn't do back-and-forth?

Martin v. Löwis loewis at informatik.hu-berlin.de
Tue Jul 9 11:22:12 EDT 2002


Piet van Oostrum <piet at cs.uu.nl> writes:

> I looked it up in The Online Edition of The Unicode Standard, Version 3.0
> In chapter 3, section 3.8 it said:
> 
>     Because every Unicode coded character sequence maps to a unique
>     sequence of code values in a given UTF, a reverse mapping can be
>     derived. Thus every UTF supports lossless roundtrip transcoding:
>     mapping from any Unicode coded character sequence S to a sequence
>     of code values and back will produce S again. To ensure that
>     round-trip transcoding is possible, a UTF mapping _must also_ map
>     invalid Unicode scalar values to unique code value sequences. These
>     invalid scalar values include FFFE, FFFF and unpaired surrogates.
[...]
> Technical Reports tr27 and tr28 do not withdraw this.

Notice that this is a requirement onto UTFs, not onto implementations
of Unicode. So the UTF must define a mapping. UTF-8 happens to map
invalid code sequences to illegal byte sequences. Unicode 3.1 now
mandates that such sequences are flagged as an error.

> I found a long discussion in the i18n-sig archives in which you also
> participated. One conclusion was that the Unicode standard contradicts
> itself in this area.

That may well be. If so, the implementation should chose the most
likely interpretation. Since the requirement to flag errors is a new
one, it is likely that this is intentional, and any text contradicting
this requirement is in error. This is the interpretation that Python
has chosen.

Regards,
Martin




More information about the Python-list mailing list