Unicode utf-8 doesn't do back-and-forth?

Piet van Oostrum piet at cs.uu.nl
Tue Jul 9 09:31:29 EDT 2002


>>>>> loewis at informatik.hu-berlin.de (Martin v. Löwis) (MvL) writes:

MvL> Piet van Oostrum <piet at cs.uu.nl> writes:
>> Well, I looked into the Unicode specs and it says that even if single
>> surrogates appear in a string, the UTF-8 encoding should generate a valid
>> UTF-8 byte sequence, which on encoding should give the same surrogate. So
>> I would say this is a bug in the UTF-8 encoding.

MvL> Which Unicode specs did you look at? Unicode TR #28 (aka Unicode 3.2),

I looked it up in The Online Edition of The Unicode Standard, Version 3.0
In chapter 3, section 3.8 it said:

    Because every Unicode coded character sequence maps to a unique
    sequence of code values in a given UTF, a reverse mapping can be
    derived. Thus every UTF supports lossless roundtrip transcoding:
    mapping from any Unicode coded character sequence S to a sequence
    of code values and back will produce S again. To ensure that
    round-trip transcoding is possible, a UTF mapping _must also_ map
    invalid Unicode scalar values to unique code value sequences. These
    invalid scalar values include FFFE, FFFF and unpaired surrogates.

This sentence is also in the FAQ:
http://www.unicode.org/unicode/faq/utf_bom.html

Technical Reports tr27 and tr28 do not withdraw this.


MvL> http://www.unicode.org/unicode/reports/tr28/

MvL> says

MvL> <quote>
MvL> The definition of transformation formats such as UTF-8 allowed
MvL> conformant processes to interpret certain sequences called irregular
MvL> sequences. These irregular sequences are those that would be produced
MvL> by transforming supplementary code points as if they were a sequence
MvL> of two surrogate code points.

MvL> To tighten the definitions, in Unicode 3.2 such irregular sequences
MvL> are now illegal.
MvL> <quote>

I think the above applies to something different, namely, if you have a
surrogate pair that encodes a legal Unicode character, and the surrogates
are independently transformed to UTF-8, rather than being transformed as a
single character.

MvL> Table 3.1B of the same document explicitly lists the byte sequences
MvL> that would denote code points D800-D8FF as illegal.

You are right.

I found a long discussion in the i18n-sig archives in which you also
participated. One conclusion was that the Unicode standard contradicts
itself in this area.

-- 
Piet van Oostrum <piet at cs.uu.nl>
URL: http://www.cs.uu.nl/~piet [PGP]
Private email: P.van.Oostrum at hccnet.nl



More information about the Python-list mailing list