Unicode utf-8 doesn't do back-and-forth?

Tue Jul 9 11:22:12 EDT 2002

Piet van Oostrum <piet at cs.uu.nl> writes:

> I looked it up in The Online Edition of The Unicode Standard, Version 3.0
> In chapter 3, section 3.8 it said:
> 
>     Because every Unicode coded character sequence maps to a unique
>     sequence of code values in a given UTF, a reverse mapping can be
>     derived. Thus every UTF supports lossless roundtrip transcoding:
>     mapping from any Unicode coded character sequence S to a sequence
>     of code values and back will produce S again. To ensure that
>     round-trip transcoding is possible, a UTF mapping _must also_ map
>     invalid Unicode scalar values to unique code value sequences. These
>     invalid scalar values include FFFE, FFFF and unpaired surrogates.
[...]
> Technical Reports tr27 and tr28 do not withdraw this.

Notice that this is a requirement onto UTFs, not onto implementations
of Unicode. So the UTF must define a mapping. UTF-8 happens to map
invalid code sequences to illegal byte sequences. Unicode 3.1 now
mandates that such sequences are flagged as an error.

> I found a long discussion in the i18n-sig archives in which you also
> participated. One conclusion was that the Unicode standard contradicts
> itself in this area.

That may well be. If so, the implementation should chose the most
likely interpretation. Since the requirement to flag errors is a new
one, it is likely that this is intentional, and any text contradicting
this requirement is in error. This is the interpretation that Python
has chosen.

Regards,
Martin