Unicode utf-8 doesn't do back-and-forth?
Piet van Oostrum
piet at cs.uu.nl
Tue Jul 9 09:31:29 EDT 2002
>>>>> loewis at informatik.hu-berlin.de (Martin v. Löwis) (MvL) writes:
MvL> Piet van Oostrum <piet at cs.uu.nl> writes:
>> Well, I looked into the Unicode specs and it says that even if single
>> surrogates appear in a string, the UTF-8 encoding should generate a valid
>> UTF-8 byte sequence, which on encoding should give the same surrogate. So
>> I would say this is a bug in the UTF-8 encoding.
MvL> Which Unicode specs did you look at? Unicode TR #28 (aka Unicode 3.2),
I looked it up in The Online Edition of The Unicode Standard, Version 3.0
In chapter 3, section 3.8 it said:
Because every Unicode coded character sequence maps to a unique
sequence of code values in a given UTF, a reverse mapping can be
derived. Thus every UTF supports lossless roundtrip transcoding:
mapping from any Unicode coded character sequence S to a sequence
of code values and back will produce S again. To ensure that
round-trip transcoding is possible, a UTF mapping _must also_ map
invalid Unicode scalar values to unique code value sequences. These
invalid scalar values include FFFE, FFFF and unpaired surrogates.
This sentence is also in the FAQ:
http://www.unicode.org/unicode/faq/utf_bom.html
Technical Reports tr27 and tr28 do not withdraw this.
MvL> http://www.unicode.org/unicode/reports/tr28/
MvL> says
MvL> <quote>
MvL> The definition of transformation formats such as UTF-8 allowed
MvL> conformant processes to interpret certain sequences called irregular
MvL> sequences. These irregular sequences are those that would be produced
MvL> by transforming supplementary code points as if they were a sequence
MvL> of two surrogate code points.
MvL> To tighten the definitions, in Unicode 3.2 such irregular sequences
MvL> are now illegal.
MvL> <quote>
I think the above applies to something different, namely, if you have a
surrogate pair that encodes a legal Unicode character, and the surrogates
are independently transformed to UTF-8, rather than being transformed as a
single character.
MvL> Table 3.1B of the same document explicitly lists the byte sequences
MvL> that would denote code points D800-D8FF as illegal.
You are right.
I found a long discussion in the i18n-sig archives in which you also
participated. One conclusion was that the Unicode standard contradicts
itself in this area.
--
Piet van Oostrum <piet at cs.uu.nl>
URL: http://www.cs.uu.nl/~piet [PGP]
Private email: P.van.Oostrum at hccnet.nl
More information about the Python-list
mailing list